Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Partial Match or RegEx in Search Console's URL Parameters Tool?
-
So I currently have approximately 1000 of these URLs indexed, when I only want roughly 100 of them.
Let's say the URL is www.example.com/page.php?par1=ABC123=&par2=DEF456=&par3=GHI789=
All the indexed URLs follow that same kinda format, but I only want to index the URLs that have a par1 of ABC (but that could be ABC123 or ABC456 or whatever). Using URL Parameters tool in Search Console, I can ask Googlebot to only crawl URLs with a specific value. But is there any way to get a partial match, using regex maybe?
Am I wasting my time with Search Console, and should I just disallow any page.php without par1=ABC in robots.txt?
-
No problem
Hope you get it sorted!
-Andy
-
Thank you!
-
Haha, I think the train passed the station on that one. I would have realised eventually... XD
Thanks for your help!
-
Don't forget that . & ? have a specific meaning within regex - if you want to use them for pattern matching you will have to escape them. Also be aware that not all bots are capable of interpreting regex in robots.txt - you might want to be more explicit on the user agent - only using regex for Google bot.
User-agent: Googlebot
#disallowing page.php and any parameters after it
disallow: /page.php
#but leaving anything that starts with par1=ABC
allow: page.php?par1=ABC
Dirk
-
Ah sorry I missed that bit!
-Andy
-
Disallowing them would be my first priority really, before removing from index.
The trouble with this is that if you disallow first, Google won't be able to crawl the page to act on the noindex. If you add a noindex flag, Google won't index them the next time it comes-a-crawling and then you will be good to disallow
I'm not actually sure of the best way for you to get the noindex in to the page header of those pages though.
-Andy
-
Yep, have done. (Briefly mentioned in my previous response.) Doesn't pass
-
I thought so too, but according to Google the trailing wildcard is completely unnecessary, and only needs to be used mid-URL.
-
Hi Andy,
Disallowing them would be my first priority really, before removing from index. Didn't want to remove them before I've blocked Google from crawling them in case they get added back again next time Google comes a-crawling, as has happened before when I've simply removed a URL here and there. Does that make sense or am I getting myself mixed up here?
My other hack of a solution would be to check the URL in the page.php, and if URL includes par1=ABC then insert noindex meta tag. (Not sure if that would work well or not...)
-
My guess would be that this line needs an * at the end.
Allow: /page.php?par1=ABC* -
Sorry Martijn, just to jump in here for a second - Ria, you can test this via the Robots.txt testing tool in search console before going live to make sure it work.
-Andy
-
Hi Martijn, thanks for your response!
I'm currently looking at something like this...
**user-agent: *** #disallowing page.php and any parameters after it
disallow: /page.php #but leaving anything that starts with par1=ABC
allow: /page.php?par1=ABCI would have thought that you could disallow things broadly like that and give an exception, as you can with files in disallowed folders. But it's not passing Google's robots.txt Tester.
One thing that's probably worth mentioning really is that there are only two variables that I want to allow of the par1 parameter. For example's sake, ABC123 and ABC456. So would need to be either a partial match or "this or that" kinda deal, disallowing everything else.
-
Hi Ria,
I have never tried regular expressions in this way, so I can't tell you if this would work or not.
However, If all 1000 of these URL's are already indexed, just disallowing access won't then remove them from Google. You would ideally be able to place a noindex tag on those pages and let Google act on them, then you will be good to disallow. I am pretty sure there is no option to noindex under the URL Parameter Tool.
I hope that makes sense?
-Andy
-
Hi Ria,
What you could do, but it also depends on the rest of your structure is Disallow these urls based on the parameters (what you could do in a worst case scenario is that you would disallow all URLs and then put an exception Allow in there as well to make sure you still have the right URLs being indexed).
Martijn.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
After hack and remediation, thousands of URL's still appearing as 'Valid' in google search console. How to remedy?
I'm working on a site that was hacked in March 2019 and in the process, nearly 900,000 spam links were generated and indexed. After remediation of the hack in April 2019, the spammy URLs began dropping out of the index until last week, when Search Console showed around 8,000 as "Indexed, not submitted in sitemap" but listed as "Valid" in the coverage report and many of them are still hack-related URLs that are listed as being indexed in March 2019, despite the fact that clicking on them leads to a 404. As of this Saturday, the number jumped up to 18,000, but I have no way of finding out using the search console reports why the jump happened or what are the new URLs that were added, the only sort mechanism is last crawled and they don't show up there. How long can I expect it to take for these remaining urls to also be removed from the index? Is there any way to expedite the process? I've submitted a 'new' sitemap several times, which (so far) has not helped. Is there any way to see inside the new GSC view why/how the number of valid URLs in the indexed doubled over one weekend?
Intermediate & Advanced SEO | | rickyporco0 -
Forwarded vanity domains, suddenly resolving to 404 with appended URL's ending in random 5 characters
We have several vanity domains that forward to various pages on our primary domain.
Intermediate & Advanced SEO | | SS.Digital
e.g. www.vanity.com (301)--> www.mydomain.com/sub-page (200) These forwards have been in place for months or even years and have worked fine. As of yesterday, we have seen the following problem. We have made no changes in the forwarding settings. Now, inconsistently, they sometimes resolve and sometimes they do not. When we load the vanity URL with Chrome Dev Tools (Network Pane) open, it shows the following redirect chains, where xxxxx represents a random 5 character string of lower and upper case letters. (e.g. VGuTD) EXAMPLE:
www.vanity.com (302, Found) -->
www.vanity.com/xxxxx (302, Found) -->
www.vanity.com/xxxxx (302, Found) -->
www.vanity.com/xxxxx/xxxxx (302, Found) -->
www.mydomain.com/sub-page/xxxxx (404, Not Found) This is just one example, the amount of redirects, vary wildly. Sometimes there is only 1 redirect, sometimes there are as many as 5. Sometimes the request will ultimately resolve on the correct mydomain.com/sub-page, but usually it does not (as in the example above). We have cross-checked across every browser, device, private/non-private, cookies cleared, on and off of our network etc... This leads us to believe that it is not at the device or host level. Our Registrar is Godaddy. They have not encountered this issue before, and have no idea what this 5 character string is from. I tend to believe them because per our analytics, we have determined that this problem only started yesterday. Our primary question is, has anybody else encountered this problem either in the last couple days, or at any time in the past? We have come up with a solution that works to alleviate the problem, but to implement it across hundreds of vanity domains will take us an inordinate amount of time. Really hoping to fix the cause of the problem instead of just treating the symptom.0 -
Site-wide Canonical Rewrite Rule for Multiple Currency URL Parameters?
Hi Guys, I am currently working with an eCommerce site which has site-wide duplicate content caused by currency URL parameter variations. Example: https://www.marcb.com/ https://www.marcb.com/?setCurrencyId=3 https://www.marcb.com/?setCurrencyId=2 https://www.marcb.com/?setCurrencyId=1 My initial thought is to create a bunch of canonical tags which will pass on link equity to the core URL version. However I was wondering if there was a rule which could be implemented within the .htaccess file that will make the canonical site-wide without being so labour intensive. I also noticed that these URLs are being indexed in Google, so would it be worth setting a site-wide noindex to these variations also? Thanks
Intermediate & Advanced SEO | | NickG-1230 -
Crawled page count in Search console
Hi Guys, I'm working on a project (premium-hookahs.nl) where I stumble upon a situation I can’t address. Attached is a screenshot of the crawled pages in Search Console. History: Doing to technical difficulties this webshop didn’t always no index filterpages resulting in thousands of duplicated pages. In reality this webshops has less than 1000 individual pages. At this point we took the following steps to result this: Noindex filterpages. Exclude those filterspages in Search Console and robots.txt. Canonical the filterpages to the relevant categoriepages. This however didn’t result in Google crawling less pages. Although the implementation wasn’t always sound (technical problems during updates) I’m sure this setup has been the same for the last two weeks. Personally I expected a drop of crawled pages but they are still sky high. Can’t imagine Google visits this site 40 times a day. To complicate the situation: We’re running an experiment to gain positions on around 250 long term searches. A few filters will be indexed (size, color, number of hoses and flavors) and three of them can be combined. This results in around 250 extra pages. Meta titles, descriptions, h1 and texts are unique as well. Questions: - Excluding in robots.txt should result in Google not crawling those pages right? - Is this number of crawled pages normal for a website with around 1000 unique pages? - What am I missing? BxlESTT
Intermediate & Advanced SEO | | Bob_van_Biezen0 -
Does Google Read URL's if they include a # tag? Re: SEO Value of Clean Url's
An ECWID rep stated in regards to an inquiry about how the ECWID url's are not customizable, that "an important thing is that it doesn't matter what these URLs look like, because search engines don't read anything after that # in URLs. " Example http://www.runningboards4less.com/general-motors#!/Classic-Pro-Series-Extruded-2/p/28043025/category=6593891 Basically all of this: #!/Classic-Pro-Series-Extruded-2/p/28043025/category=6593891 That is a snippet out of a conversation where ECWID said that dirty urls don't matter beyond a hashtag... Is that true? I haven't found any rule that Google or other search engines (Google is really the most important) don't index, read, or place value on the part of the url after a # tag.
Intermediate & Advanced SEO | | Atlanta-SMO0 -
Using the same content on different TLD's
HI Everyone, We have clients for whom we are going to work with in different countries but sometimes with the same language. For example we might have a client in a competitive niche working in Germany, Austria and Switzerland (Swiss German) ie we're going to potentially rewrite our website three times in German, We're thinking of using Google's href lang tags and use pretty much the same content - is this a safe option, has anyone actually tries this successfully or otherwise? All answers appreciated. Cheers, Mel.
Intermediate & Advanced SEO | | dancape1 -
Url structure for multiple search filters applied to products
We have a product catalog with several hundred similar products. Our list of products allows you apply filters to hone your search, so that in fact there are over 150,000 different individual searches you could come up with on this page. Some of these searches are relevant to our SEO strategy, but most are not. Right now (for the most part) we save the state of each search with the fragment of the URL, or in other words in a way that isn't indexed by the search engines. The URL (without hashes) ranks very well in Google for our one main keyword. At the moment, Google doesn't recognize the variety of content possible on this page. An example is: http://www.example.com/main-keyword.html#style=vintage&color=blue&season=spring We're moving towards a more indexable URL structure and one that could potentially save the state of all 150,000 searches in a way that Google could read. An example would be: http://www.example.com/main-keyword/vintage/blue/spring/ I worry, though, that giving so many options in our URL will confuse Google and make a lot of duplicate content. After all, we only have a few hundred products and inevitably many of the searches will look pretty similar. Also, I worry about losing ground on the main http://www.example.com/main-keyword.html page, when it's ranking so well at the moment. So I guess the questions are: Is there such a think as having URLs be too specific? Should we noindex or set rel=canonical on the pages whose keywords are nested too deep? Will our main keyword's page suffer when it has to share all the inbound links with these other, more specific searches?
Intermediate & Advanced SEO | | boxcarpress0 -
To subnav or NOT to subnav... that's my question.... :)
We are working on a new website that is golf related and wondering about whether or not we should set up a subnavigation dropdown menu from the main menu. For example: GOLF PACKAGES
Intermediate & Advanced SEO | | JamesO
>> 2 Round Packages
>> 3 Round Packages
>> 4 Round Packages
>> 5 Round Packages GOLF COURSES
>> North End Courses
>> Central Courses
>> South End Courses This would actually be very beneficial to our users from a usability standpoint, BUT what about from an SEO standpoint? Is diverting all the link juice to these inner pages from the main site navigation harmful? Should we just create a page for GOLF PACKAGES and break it down on that page?0