Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Googlebot does not obey robots.txt disallow
-
Hi Mozzers!
We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter.
We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming.
But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems.
What could be the reason?
Best regards,
Martin
-
Sorry for the late reply. Feel free to send me a PM. (not sure I can help, but more than happy to take a look)
-
We do not currently have any sanitation rules in order to maintain the nocrawl param. But that is a good point. 301:ing will be difficult for us but I will definitely add the nocrawl param to the rel canonical of those internal SERPs.
-
Thank you, Igol. I will definitely look into your first suggestion.
-
Thank you, Cyrus.
This is what it looks like:
User-agent: *
Disallow: /nocrawl=1The weird thing is that when testing one of the sample URLs (given by Google as "problematic" in the GWMT message and that contains the nocrawl param) on the GWMT "Blocked URLs" page by entering the contents of our robots.txt and the sample URL, Google says crawling of the URL is disallowed for Googlebot.
On the top of the same page, it says "Never" under the heading "Fetched when" (translated from Swedish..). But when i "Fetch as Google" our robots.txt, Googlebot has no problems fetching it. So i guess the "Never" information is due to a GWMT bug?
I also tested our robots.txt against your recommended service http://www.frobee.com/robots-txt-check. It says all robots has access to the sample URL above, but I gather the tool is not wildcard-savvy.
I will not disclose our domain in this context, please tell me if it is ok to send you a PW.
About the noindex stuff. Basically, the nocrawl param is added to internal links pointing to internal search result pages filtered by more than two params. Although we allow crawling of less complicated internal serps, we disallow indexing of most of them by "meta noindex".
-
Thanks.
100% agree with the Meta Noindex suggestion.
-
It can be tricky blocking parameters with robots.txt. The first thing you want to do is make sure your are actually blocking the URLs. There are a few good robots.txt checkers out there that can help:
You're file is probably going to look something like:
User-agent: *
Disallow: /*?nocrawl=1... but this could vary depending on exactly you don't want crawled
+1 to Igal's suggestion of handling these via parameter settings in Google Webmaster Tools: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
Finally, if your goal is to keep search results out of the index (it probably should be) then you should also highly consider using a meta robots NOINDEX tag on all search results pages. You can also slap a nofollow on links pointing to search results as this might also help Google steer clear of those pages.
Best of luck!
Edit: Here's what John Wu of Google Webmaster has to say...
"We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-)."
-
Didn't say it wasn't.
I`m just not sure how these rules apply to parameters, since they are not a part of the "core" URL.
(For example: What happens if I take a URL from your site, change a nocrawl=1 to nocrawl=0 and link to it from mine?
Do you have any URL sanitation rules in place to overcome that or will the page be indexed by Googlebot when it crawls my site and moves on to yours?)Personally, when dealing with parameters, I find it easier to work with WMT so I was offering an easier workaround, (at least for me)
To tell you the truth, I would use hard-coded on page meta noindex/nofollow here (again, as parameters can be so easily manipulated).
-
Igal, thank your for replying.
But robots.txt disallowing URLs by matching patterns has been supported by Googlebot for a long time now.
-
Hi
I`m not sure if this is the best way to go about it.
Robots.txt is commonly used for folder level disallow rules, I`m not sure how it will respond to parameters.
Having said that, there are several things you can do here:
1. You can use WMT to zero in on this parameter and prevent it from being searched.
To do so choose Configuration>>URL Parameters, answer "Yes" to the question about content change and
check-in the 3rd bullet (Only URL with value...) Of course you'll need to choose "1" as the right value.2. If this still didn't solve your issue, you might want to try using htacess + regex to prevent access by user agent.
You can find user-agent information here Googlebot user agent listAlso, you may want to check my blog post about some of the less known Googlebot Facts (shameless self-promotion)
Best
Igal
-
I'll send you a PW, Des.
-
What the domain.?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Multiple robots.txt files on server
Hi! I have previously hired a developer to put up my site and noticed afterwards that he did not know much about SEO. This lead me to starting to learn myself and applying some changes step by step. One of the things I am currently doing is inserting sitemap reference in robots.txt file (which was not there before). But just now when I wanted to upload the file via FTP to my server I found multiple ones - in different sizes - and I dont know what to do with them? Can I remove them? I have downloaded and opened them and they seem to be 2 textfiles and 2 dupplicates. Names: robots.txt (original dupplicate)
Technical SEO | | mjukhud
robots.txt-Original (original)
robots.txt-NEW (other content)
robots.txt-Working (other content dupplicate) Would really appreciate help and expertise suggestions. Thanks!0 -
Robots txt. in page with 301 redirect
We currently have a a series of help pages that we would like to disallow from our robots txt. The thing is that these help pages are located in our old website, which now has a 301 redirect to current site. Which is the proper way to go around? 1- Add the pages we want to disallow to the robots.txt of the new website? 2- Break the redirect momentarily and add the pages to the robots.txt of the old one? Thanks
Technical SEO | | Kilgray0 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
Guys & Gals anyone know if urllist.txt is still used?
I'm using a tool which generates urllist.txt and looking on the SEO Forums it seems that Yahoo used to use this. What I'd like to know is is it still used anywhere and should we have it on the site?
Technical SEO | | danwebman0 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
No indexing url including query string with Robots txt
Dear all, how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt? Thanks!
Technical SEO | | HMK-NL0 -
OK to block /js/ folder using robots.txt?
I know Matt Cutts suggestions we allow bots to crawl css and javascript folders (http://www.youtube.com/watch?v=PNEipHjsEPU) But what if you have lots and lots of JS and you dont want to waste precious crawl resources? Also, as we update and improve the javascript on our site, we iterate the version number ?v=1.1... 1.2... 1.3... etc. And the legacy versions show up in Google Webmaster Tools as 404s. For example: http://www.discoverafrica.com/js/global_functions.js?v=1.1
Technical SEO | | AndreVanKets
http://www.discoverafrica.com/js/jquery.cookie.js?v=1.1
http://www.discoverafrica.com/js/global.js?v=1.2
http://www.discoverafrica.com/js/jquery.validate.min.js?v=1.1
http://www.discoverafrica.com/js/json2.js?v=1.1 Wouldn't it just be easier to prevent Googlebot from crawling the js folder altogether? Isn't that what robots.txt was made for? Just to be clear - we are NOT doing any sneaky redirects or other dodgy javascript hacks. We're just trying to power our content and UX elegantly with javascript. What do you guys say: Obey Matt? Or run the javascript gauntlet?0 -
Invisible robots.txt?
So here's a weird one... Client comes to me for some simple changes, turns out there are some major issues with the site, one of which is that none of the correct content pages are showing up in Google, just ancillary (outdated) ones. Looks like an issue because even the main homepage isn't showing up with a "site:domain.com" So, I add to Webmaster Tools and, after an hour or so, I get the red bar of doom, "robots.txt is blocking important pages." I check it out in Webmasters and, sure enough, it's a "User agent: * Disallow /" ACK! But wait... there's no robots.txt to be found on the server. I can go to domain.com/robots.txt and see it but nothing via FTP. I upload a new one and, thankfully, that is now showing but I've never seen that before. Question is: can a robots.txt file be stored in a way that can't be seen? Thanks!
Technical SEO | | joshcanhelp0