Googlebot does not obey robots.txt disallow
-
Hi Mozzers!
We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter.
We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming.
But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems.
What could be the reason?
Best regards,
Martin
-
Sorry for the late reply. Feel free to send me a PM. (not sure I can help, but more than happy to take a look)
-
We do not currently have any sanitation rules in order to maintain the nocrawl param. But that is a good point. 301:ing will be difficult for us but I will definitely add the nocrawl param to the rel canonical of those internal SERPs.
-
Thank you, Igol. I will definitely look into your first suggestion.
-
Thank you, Cyrus.
This is what it looks like:
User-agent: *
Disallow: /nocrawl=1The weird thing is that when testing one of the sample URLs (given by Google as "problematic" in the GWMT message and that contains the nocrawl param) on the GWMT "Blocked URLs" page by entering the contents of our robots.txt and the sample URL, Google says crawling of the URL is disallowed for Googlebot.
On the top of the same page, it says "Never" under the heading "Fetched when" (translated from Swedish..). But when i "Fetch as Google" our robots.txt, Googlebot has no problems fetching it. So i guess the "Never" information is due to a GWMT bug?
I also tested our robots.txt against your recommended service http://www.frobee.com/robots-txt-check. It says all robots has access to the sample URL above, but I gather the tool is not wildcard-savvy.
I will not disclose our domain in this context, please tell me if it is ok to send you a PW.
About the noindex stuff. Basically, the nocrawl param is added to internal links pointing to internal search result pages filtered by more than two params. Although we allow crawling of less complicated internal serps, we disallow indexing of most of them by "meta noindex".
-
Thanks.
100% agree with the Meta Noindex suggestion.
-
It can be tricky blocking parameters with robots.txt. The first thing you want to do is make sure your are actually blocking the URLs. There are a few good robots.txt checkers out there that can help:
You're file is probably going to look something like:
User-agent: *
Disallow: /*?nocrawl=1... but this could vary depending on exactly you don't want crawled
+1 to Igal's suggestion of handling these via parameter settings in Google Webmaster Tools: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
Finally, if your goal is to keep search results out of the index (it probably should be) then you should also highly consider using a meta robots NOINDEX tag on all search results pages. You can also slap a nofollow on links pointing to search results as this might also help Google steer clear of those pages.
Best of luck!
Edit: Here's what John Wu of Google Webmaster has to say...
"We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-)."
-
Didn't say it wasn't.
I`m just not sure how these rules apply to parameters, since they are not a part of the "core" URL.
(For example: What happens if I take a URL from your site, change a nocrawl=1 to nocrawl=0 and link to it from mine?
Do you have any URL sanitation rules in place to overcome that or will the page be indexed by Googlebot when it crawls my site and moves on to yours?)Personally, when dealing with parameters, I find it easier to work with WMT so I was offering an easier workaround, (at least for me)
To tell you the truth, I would use hard-coded on page meta noindex/nofollow here (again, as parameters can be so easily manipulated).
-
Igal, thank your for replying.
But robots.txt disallowing URLs by matching patterns has been supported by Googlebot for a long time now.
-
Hi
I`m not sure if this is the best way to go about it.
Robots.txt is commonly used for folder level disallow rules, I`m not sure how it will respond to parameters.
Having said that, there are several things you can do here:
1. You can use WMT to zero in on this parameter and prevent it from being searched.
To do so choose Configuration>>URL Parameters, answer "Yes" to the question about content change and
check-in the 3rd bullet (Only URL with value...) Of course you'll need to choose "1" as the right value.2. If this still didn't solve your issue, you might want to try using htacess + regex to prevent access by user agent.
You can find user-agent information here Googlebot user agent listAlso, you may want to check my blog post about some of the less known Googlebot Facts (shameless self-promotion)
Best
Igal
-
I'll send you a PW, Des.
-
What the domain.?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
RegEx help needed for robots.txt potential conflict
I've created a robots.txt file for a new Magento install and used an existing site-map that was on the Magento help forums but the trouble is I can't decipher something. It seems that I am allowing and disallowing access to the same expression for pagination. My robots.txt file (and a lot of other Magento site-maps it seems) includes both: Allow: /*?p= and Disallow: /?p=& I've searched for help on RegEx and I can't see what "&" does but it seems to me that I'm allowing crawler access to all pagination URLs, but then possibly disallowing access to all pagination URLs that include anything other than just the page number? I've looked at several resources and there is practically no reference to what "&" does... Can anyone shed any light on this, to ensure I am allowing suitable access to a shop? Thanks in advance for any assistance
Technical SEO | | MSTJames0 -
Robots.txt issue - site resubmission needed?
We recently had an issue when a load of new files were transferred from our dev server to the live site, which unfortunately included the dev site's robots.txt file which had a disallow:/ instruction. Bad! Luckily I spotted it quickly and the file has been replaced. The extent of the damage seems to be that some descriptions aren't displaying and we're getting a message about robots.txt in the SERPs for a few keywords. I've done a site: search and generally it seems to be OK for 99% of our pages. Our positions don't seem to be affected right now but obviously it's not great for the CTRs on those keywords affected. My question is whether there is anything I can do to bring the updated robots.txt file to Google's attention? Or should we just wait and sit it out? Thanks in advance for your answers!
Technical SEO | | GBC0 -
Why is either Rogerbot or (if it is the case) Googlebots not recognizing keyword usage in my body text?
I have a client that does liposuction as one of their main services, they have been ranked in the top 1-5 for their keywords "sarasota liposuction" with different variations of the words for a long time, and suddenly have dropped about 10-12 places down to #15 in the engine. I went to investigate this and actually came to the "on-page analysis" tool for SEOmoz pro, where oddly enough it says that there is no mention of the target keyword in the body content (on-page analysis tool screenshot attached). I didn't quite understand why it would not recognize the obvious keywords in the body text so I went back to the page and inspected further. The keywords have an odd featured link that links up to an internally hosted keyword glossary for definitions of terms that people might not know directly. These definitions pop up in a lightbox upon clicking the keyword (liposuction lightbox screenshots attached). I have no idea why google would not recognize these words as they have the text in between the link, yet if there is something wrong with the code syntax etc. it might possibly hender the engine from seeing the body text of the link? any help would be greatly appreciated! Thank you so much! Phn2m Phn2m.png bWr5K.png V36CL.png
Technical SEO | | jbster130 -
Help needed with robots.txt regarding wordpress!
Here is my robots.txt from google webmaster tools. These are the pages that are being blocked and I am not sure which of these to get rid of in order to unblock blog posts from being searched. http://ensoplastics.com/theblog/?cat=743 http://ensoplastics.com/theblog/?p=240 These category pages and blog posts are blocked so do I delete the /? ...I am new to SEO and web development so I am not sure why the developer of this robots.txt file would block pages and posts in wordpress. It seems to me like that is the reason why someone has a blog so it can be searched and get more exposure for SEO purposes. IS there a reason I should block any pages contained in wodrpress? Sitemap: http://www.ensobottles.com/blog/sitemap.xml User-agent: Googlebot Disallow: /*/trackback Disallow: /*/feed Disallow: /*/comments Disallow: /? Disallow: /*? Disallow: /page/
Technical SEO | | ENSO
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/ Disallow: /trackback Disallow: /commentsDisallow: /feed0 -
Robots.txt for subdomain
Hi there Mozzers! I have a subdomain with duplicate content and I'd like to remove these pages from the mighty Google index. The problem is: the website is build in Drupal and this subdomain does not have it's own robots.txt. So I want to ask you how to disallow and noindex this subdomain. Is it possible to add this to the root robots.txt: User-agent: *
Technical SEO | | Partouter
Disallow: /subdomain.root.nl/ User-agent: Googlebot
Noindex: /subdomain.root.nl/ Thank you in advance! Partouter0 -
Quick robots.txt check
We're working on an SEO update for http://www.gear-zone.co.uk at the moment, and I was wondering if someone could take a quick look at the new robots file (http://gearzone.affinitynewmedia.com/robots.txt) to make sure we haven't missed anything? Thanks
Technical SEO | | neooptic0 -
Site not being Indexed that fast anymore, Is something wrong with this Robots.txt
My wordpress site's robots.txt used to be this: User-agent: * Disallow: Sitemap: http://www.domainame.com/sitemap.xml.gz I also have all in one SEO installed and other than posts, tags are also index,follow on my site. My new posts used to appear on google in seconds after publishing. I changed the robots.txt to following and now post indexing takes hours. Is there something wrong with this robots.txt? User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /wp-login.php Disallow: /wp-login.php Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /author Disallow: /category Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /login/ Disallow: /wget/ Disallow: /httpd/ Disallow: /*.php$ Disallow: /? Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ Disallow: /? Disallow: /*?Allow: /wp-content/uploads User-agent: TechnoratiBot/8.1 Disallow: ia_archiverUser-agent: ia_archiver Disallow: / disable duggmirror User-agent: duggmirror Disallow: / allow google image bot to search all imagesUser-agent: Googlebot-Image Disallow: /wp-includes/ Allow: /* # allow adsense bot on entire siteUser-agent: Mediapartners-Google* Disallow: Allow: /* Sitemap: http://www.domainname.com/sitemap.xml.gz
Technical SEO | | ideas1230 -
Robots.txt
My campaign hse24 (www.hse24.de) is not being crawled any more ... Do you think this can be a problem of the robots.txt? I always thought that Google and friends are interpretating the file correct, seen that he site was crawled since last week. Thanks a lot Bernd NB: Here is the robots.txt: User-Agent: * Disallow: / User-agent: Googlebot User-agent: Googlebot-Image User-agent: Googlebot-Mobile User-agent: MSNBot User-agent: Slurp User-agent: yahoo-mmcrawler User-agent: psbot Disallow: /is-bin/ Allow: /is-bin/INTERSHOP.enfinity/WFS/HSE24-DE-Site/de_DE/-/EUR/hse24_Storefront-Start Allow: /is-bin/INTERSHOP.enfinity/WFS/HSE24-AT-Site/de_DE/-/EUR/hse24_Storefront-Start Allow: /is-bin/INTERSHOP.enfinity/WFS/HSE24-CH-Site/de_DE/-/CHF/hse24_Storefront-Start Allow: /is-bin/INTERSHOP.enfinity/WFS/HSE24-DE-Site/de_DE/-/EUR/hse24_DisplayProductInformation-Start Allow: /is-bin/INTERSHOP.enfinity/WFS/HSE24-AT-Site/de_DE/-/EUR/hse24_DisplayProductInformation-Start Allow: /is-bin/INTERSHOP.enfinity/WFS/HSE24-CH-Site/de_DE/-/CHF/hse24_DisplayProductInformation-Start Allow: /is-bin/intershop.static/WFS/HSE24-Site/-/Editions/ Allow: /is-bin/intershop.static/WFS/HSE24-Site/-/Editions/Root%20Edition/units/HSE24/Beratung/
Technical SEO | | remino630