Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Googlebot does not obey robots.txt disallow
- 
					
					
					
					
 Hi Mozzers! We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter. We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming. But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems. What could be the reason? Best regards, Martin 
- 
					
					
					
					
 Sorry for the late reply. Feel free to send me a PM. (not sure I can help, but more than happy to take a look) 
- 
					
					
					
					
 We do not currently have any sanitation rules in order to maintain the nocrawl param. But that is a good point. 301:ing will be difficult for us but I will definitely add the nocrawl param to the rel canonical of those internal SERPs. 
- 
					
					
					
					
 Thank you, Igol. I will definitely look into your first suggestion. 
- 
					
					
					
					
 Thank you, Cyrus. This is what it looks like: User-agent: * 
 Disallow: /nocrawl=1The weird thing is that when testing one of the sample URLs (given by Google as "problematic" in the GWMT message and that contains the nocrawl param) on the GWMT "Blocked URLs" page by entering the contents of our robots.txt and the sample URL, Google says crawling of the URL is disallowed for Googlebot. On the top of the same page, it says "Never" under the heading "Fetched when" (translated from Swedish..). But when i "Fetch as Google" our robots.txt, Googlebot has no problems fetching it. So i guess the "Never" information is due to a GWMT bug? I also tested our robots.txt against your recommended service http://www.frobee.com/robots-txt-check. It says all robots has access to the sample URL above, but I gather the tool is not wildcard-savvy. I will not disclose our domain in this context, please tell me if it is ok to send you a PW. About the noindex stuff. Basically, the nocrawl param is added to internal links pointing to internal search result pages filtered by more than two params. Although we allow crawling of less complicated internal serps, we disallow indexing of most of them by "meta noindex". 
- 
					
					
					
					
 Thanks. 100% agree with the Meta Noindex suggestion. 
- 
					
					
					
					
 It can be tricky blocking parameters with robots.txt. The first thing you want to do is make sure your are actually blocking the URLs. There are a few good robots.txt checkers out there that can help: You're file is probably going to look something like: User-agent: * 
 Disallow: /*?nocrawl=1... but this could vary depending on exactly you don't want crawled +1 to Igal's suggestion of handling these via parameter settings in Google Webmaster Tools: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687 Finally, if your goal is to keep search results out of the index (it probably should be) then you should also highly consider using a meta robots NOINDEX tag on all search results pages. You can also slap a nofollow on links pointing to search results as this might also help Google steer clear of those pages. Best of luck! Edit: Here's what John Wu of Google Webmaster has to say... "We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-)." 
- 
					
					
					
					
 Didn't say it wasn't.  I`m just not sure how these rules apply to parameters, since they are not a part of the "core" URL. (For example: What happens if I take a URL from your site, change a nocrawl=1 to nocrawl=0 and link to it from mine? 
 Do you have any URL sanitation rules in place to overcome that or will the page be indexed by Googlebot when it crawls my site and moves on to yours?)Personally, when dealing with parameters, I find it easier to work with WMT so I was offering an easier workaround, (at least for me) To tell you the truth, I would use hard-coded on page meta noindex/nofollow here (again, as parameters can be so easily manipulated). 
- 
					
					
					
					
 Igal, thank your for replying. But robots.txt disallowing URLs by matching patterns has been supported by Googlebot for a long time now. 
- 
					
					
					
					
 Hi I`m not sure if this is the best way to go about it. Robots.txt is commonly used for folder level disallow rules, I`m not sure how it will respond to parameters. Having said that, there are several things you can do here: 1. You can use WMT to zero in on this parameter and prevent it from being searched. 
 To do so choose Configuration>>URL Parameters, answer "Yes" to the question about content change and
 check-in the 3rd bullet (Only URL with value...) Of course you'll need to choose "1" as the right value.2. If this still didn't solve your issue, you might want to try using htacess + regex to prevent access by user agent. 
 You can find user-agent information here Googlebot user agent listAlso, you may want to check my blog post about some of the less known Googlebot Facts (shameless self-promotion) Best Igal 
- 
					
					
					
					
 I'll send you a PW, Des. 
- 
					
					
					
					
 What the domain.? 
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		Google Search console says 'sitemap is blocked by robots?
 Google Search console is telling me "Sitemap contains URLs which are blocked by robots.txt." I don't understand why my sitemap is being blocked? My robots.txt look like this: User-Agent: * Technical SEO | | Extima-Christian
 Disallow: Sitemap: http://www.website.com/sitemap_index.xml It's a WordPress site, with Yoast SEO installed. Is anyone else having this issue with Google Search console? Does anyone know how I can fix this issue?1
- 
		
		
		
		
		
		Is there a limit to how many URLs you can put in a robots.txt file?
 We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda. Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls. Technical SEO | | kcb81780
- 
		
		
		
		
		
		Should I block Map pages with robots.txt?
 Hello, I have a website that was started in 1999. On the website I have map pages for each of the offices listed on my site, for which there are about 120. Each of the 120 maps is in a whole separate html page. There is no content in the page other than the map. I know all of the offices love having the map pages so I don't want to remove the pages. So, my question is would these pages with no real content be hurting the rankings of the other pages on our site? Therefore, should I block the pages with my robots.txt? Would I also have to remove these pages (in webmaster tools?) from Google for blocking by robots.txt to really work? I appreciate your feedback, thanks! Technical SEO | | imaginex0
- 
		
		
		
		
		
		Block Domain in robots.txt
 Hi. We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off... Any ideas why this could be and whether it's normal? I can send you more domain infos by personal message if you want to have a look at it. Technical SEO | | zeepartner0
- 
		
		
		
		
		
		Is there any value in having a blank robots.txt file?
 I've read an audit where the writer recommended creating and uploading a blank robots.txt file, there was no current file in place. Is there any merit in having a blank robots.txt file? What is the minimum you would include in a basic robots.txt file? Technical SEO | | NicDale0
- 
		
		
		
		
		
		Blocking URL's with specific parameters from Googlebot
 Hi, I've discovered that Googlebot's are voting on products listed on our website and as a result are creating negative ratings by placing votes from 1 to 5 for every product. The voting function is handled using Javascript, as shown below, and the script prevents multiple votes so most products end up with a vote of 1, which translates to "poor". How do I go about using robots.txt to block a URL with specific parameters only? I'm worried that I might end up blocking the whole product listing, which would result in de-listing from Google and the loss of many highly ranked pages. DON'T want to block: http://www.mysite.com/product.php?productid=1234 WANT to block: http://www.mysite.com/product.php?mode=vote&productid=1234&vote=2 Javacript button code: onclick="javascript: document.voteform.submit();" Thanks in advance for any advice given. Regards, Technical SEO | | aethereal
 Asim0
- 
		
		
		
		
		
		Should I set up a disallow in the robots.txt for catalog search results?
 When the crawl diagnostics came back for my site its showing around 3,000 pages of duplicate content. Almost all of them are of the catalog search results page. I also did a site search on Google and they have most of the results pages in their index too. I think I should just disallow the bots in the /catalogsearch/ sub folder, but I'm not sure if this will have any negative effect? Technical SEO | | JordanJudson0
 
			
		 
			
		 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				