Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Googlebot does not obey robots.txt disallow
-
Hi Mozzers!
We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter.
We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming.
But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems.
What could be the reason?
Best regards,
Martin
-
Sorry for the late reply. Feel free to send me a PM. (not sure I can help, but more than happy to take a look)
-
We do not currently have any sanitation rules in order to maintain the nocrawl param. But that is a good point. 301:ing will be difficult for us but I will definitely add the nocrawl param to the rel canonical of those internal SERPs.
-
Thank you, Igol. I will definitely look into your first suggestion.
-
Thank you, Cyrus.
This is what it looks like:
User-agent: *
Disallow: /nocrawl=1The weird thing is that when testing one of the sample URLs (given by Google as "problematic" in the GWMT message and that contains the nocrawl param) on the GWMT "Blocked URLs" page by entering the contents of our robots.txt and the sample URL, Google says crawling of the URL is disallowed for Googlebot.
On the top of the same page, it says "Never" under the heading "Fetched when" (translated from Swedish..). But when i "Fetch as Google" our robots.txt, Googlebot has no problems fetching it. So i guess the "Never" information is due to a GWMT bug?
I also tested our robots.txt against your recommended service http://www.frobee.com/robots-txt-check. It says all robots has access to the sample URL above, but I gather the tool is not wildcard-savvy.
I will not disclose our domain in this context, please tell me if it is ok to send you a PW.
About the noindex stuff. Basically, the nocrawl param is added to internal links pointing to internal search result pages filtered by more than two params. Although we allow crawling of less complicated internal serps, we disallow indexing of most of them by "meta noindex".
-
Thanks.
100% agree with the Meta Noindex suggestion.
-
It can be tricky blocking parameters with robots.txt. The first thing you want to do is make sure your are actually blocking the URLs. There are a few good robots.txt checkers out there that can help:
You're file is probably going to look something like:
User-agent: *
Disallow: /*?nocrawl=1... but this could vary depending on exactly you don't want crawled
+1 to Igal's suggestion of handling these via parameter settings in Google Webmaster Tools: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
Finally, if your goal is to keep search results out of the index (it probably should be) then you should also highly consider using a meta robots NOINDEX tag on all search results pages. You can also slap a nofollow on links pointing to search results as this might also help Google steer clear of those pages.
Best of luck!
Edit: Here's what John Wu of Google Webmaster has to say...
"We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-)."
-
Didn't say it wasn't.
I`m just not sure how these rules apply to parameters, since they are not a part of the "core" URL.
(For example: What happens if I take a URL from your site, change a nocrawl=1 to nocrawl=0 and link to it from mine?
Do you have any URL sanitation rules in place to overcome that or will the page be indexed by Googlebot when it crawls my site and moves on to yours?)Personally, when dealing with parameters, I find it easier to work with WMT so I was offering an easier workaround, (at least for me)
To tell you the truth, I would use hard-coded on page meta noindex/nofollow here (again, as parameters can be so easily manipulated).
-
Igal, thank your for replying.
But robots.txt disallowing URLs by matching patterns has been supported by Googlebot for a long time now.
-
Hi
I`m not sure if this is the best way to go about it.
Robots.txt is commonly used for folder level disallow rules, I`m not sure how it will respond to parameters.
Having said that, there are several things you can do here:
1. You can use WMT to zero in on this parameter and prevent it from being searched.
To do so choose Configuration>>URL Parameters, answer "Yes" to the question about content change and
check-in the 3rd bullet (Only URL with value...) Of course you'll need to choose "1" as the right value.2. If this still didn't solve your issue, you might want to try using htacess + regex to prevent access by user agent.
You can find user-agent information here Googlebot user agent listAlso, you may want to check my blog post about some of the less known Googlebot Facts (shameless self-promotion)
Best
Igal
-
I'll send you a PW, Des.
-
What the domain.?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Googlebot and other spiders are searching for odd links in our website trying to understand why, and what to do about it.
I recently began work on an existing Wordpress website that was revamped about 3 months ago. https://thedoctorwithin.com. I'm a bit new to Wordpress, so I thought I should reach out to some of the experts in the community.Checking ‘Not found’ Crawl Errors in Google Search Console, I notice many irrelevant links that are not present in the website, nor the database, as near as I can tell. When checking the source of these irrelevant links, I notice they’re all generated from various pages in the site, as well as non-existing pages, allegedly in the site, even though these pages have never existed. For instance: https://thedoctorwithin.com/category/seminars/newsletters/page/7/newsletters/page/3/feedback-and-testimonials/ allegedly linked from: https://thedoctorwithin.com/category/seminars/newsletters/page/7/newsletters/page/3/ (doesn’t exist) In other cases, these goofy URLs are even linked from the sitemap. BTW - all the URLs in the sitemap are valid URLs. Currently, the site has a flat structure. Nearly all the content is merely URL/content/ without further breakdown (or subdirectories). Previous site versions had a more varied page organization, but what I'm seeing doesn't seem to reflect the current page organization, nor the previous page organization. Had a similar issue, due to use of Divi's search feature. Ended up with some pretty deep non-existent links branching off of /search/, such as: https://thedoctorwithin.com/search/newsletters/page/2/feedback-and-testimonials/feedback-and-testimonials/online-continuing-education/consultations/ allegedly linked from: https://thedoctorwithin.com/search/newsletters/page/2/feedback-and-testimonials/feedback-and-testimonials/online-continuing-education/ (doesn't exist). I blocked the /search/ branches via robots.txt. No real loss, since neither /search/ nor any of its subdirectories are valid. There are numerous pre-existing categories and tags on the site. The categories and tags aren't used as pages. I suspect Google, (and other engines,) might be creating arbitrary paths from these. Looking through the site’s 404 errors, I’m seeing the same behavior from Bing, Moz and other spiders, as well. I suppose I could use Search Console to remove URL/category/ and URL/tag/. I suppose I could do the same, in regards to other legitimate spiders / search engines. Perhaps it would be better to use Mod Rewrite to lead spiders to pages that actually do exist. Looking forward to suggestions about best way to deal with these errant searches. Also curious to learn about why these are occurring. Thank you.
Technical SEO | | linkjuiced0 -
Google Search console says 'sitemap is blocked by robots?
Google Search console is telling me "Sitemap contains URLs which are blocked by robots.txt." I don't understand why my sitemap is being blocked? My robots.txt look like this: User-Agent: *
Technical SEO | | Extima-Christian
Disallow: Sitemap: http://www.website.com/sitemap_index.xml It's a WordPress site, with Yoast SEO installed. Is anyone else having this issue with Google Search console? Does anyone know how I can fix this issue?1 -
Robots.txt & meta noindex--site still shows up on Google Search
I have set up my robots.txt like this: User-agent: *
Technical SEO | | RoxBrock
Disallow: / and I have this meta tag in my on a Wordpress site, set up with SEO Yoast name="robots" content="noindex,follow"/> I did "Fetch as Google" on my Google Search Console My website is still showing up in the search results and it says this: "A description for this result is not available because of this site's robots.txt" This site has not shown up for years and now it is ranking above my site that I want to rank for this keyword. How do I get Google to ignore this site? This seems really weird and I'm confused how a site with little content, that has not been updated for years can rank higher than a site that is constantly updated and improved.1 -
Some URLs were not accessible to Googlebot due to an HTTP status error.
Hello I'm a seo newbie and some help from the community here would be greatly appreciated. I have submitted the sitemap of my website in google webmasters tools and now I got this warning: "When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot due to an HTTP status error. All accessible URLs will still be submitted." How do I fix this? What should I do? Many thanks in advance.
Technical SEO | | GoldenRanking140 -
Adding multi-language sitemaps to robots.txt
I am working on a revamped multi-language site that has moved to Magento. Each language runs off the core coding so there are no sub-directories per language. The developer has created sitemaps which have been uploaded to their respective GWT accounts. They have placed the sitemaps in new directories such as: /sitemap/uk/sitemap.xml /sitemap/de/sitemap.xml I want to add the sitemaps to the robots.txt but can't figure out how to do it. Also should they have placed the sitemaps in a single location with the file identifying each language: /sitemap/uk-sitemap.xml /sitemap/de-sitemap.xml What is the cleanest way of handling these sitemaps and can/should I get them on robots.txt?
Technical SEO | | MickEdwards0 -
Empty Meta Robots Directive - Harmful?
Hi, We had a coding update and a side-effect of that was that our directive was emptied, in other words it now reads as: on all of the site. I've since noticed that Google's cache date on all of the pages - at least, the ones I tested - have a Cached date of no later than 17 December '12 - that's the Monday after the directive was removed on mass. So, A, does anyone have solid evidence of an empty directive causing problems? Past experience, Matt Cutts, Fishkin quote, etc. And then B - It seems fairly well correlated but, does my entire site's homogenous Cached date point to this tag removal? Or is it fairly normal to have a particular cache date across a large site (we're a large ecommerce site). Our site: http://www.zando.co.za/ I'm having the directive reinstated as soon as Dev permitting. And then, for extra credit, is there a way with Google's API, or perhaps some other tool, to run an arbitrary list and retrieve Cached dates? I'd want to do this for diagnosis purposes and preferably in a way that OK with Google. I'd avoid CURLing for the cached URL and scraping out that dates with BASH, or any such kind of thing. Cheers,
Technical SEO | | RocketZando0 -
I accidentally blocked Google with Robots.txt. What next?
Last week I uploaded my site and forgot to remove the robots.txt file with this text: User-agent: * Disallow: / I dropped from page 11 on my main keywords to past page 50. I caught it 2-3 days later and have now fixed it. I re-imported my site map with Webmaster Tools and I also did a Fetch as Google through Webmaster Tools. I tweeted out my URL to hopefully get Google to crawl it faster too. Webmaster Tools no longer says that the site is experiencing outages, but when I look at my blocked URLs it still says 249 are blocked. That's actually gone up since I made the fix. In the Google search results, it still no longer has my page title and the description still says "A description for this result is not available because of this site's robots.txt – learn more." How will this affect me long-term? When will I recover my rankings? Is there anything else I can do? Thanks for your input! www.decalsforthewall.com
Technical SEO | | Webmaster1230 -
OK to block /js/ folder using robots.txt?
I know Matt Cutts suggestions we allow bots to crawl css and javascript folders (http://www.youtube.com/watch?v=PNEipHjsEPU) But what if you have lots and lots of JS and you dont want to waste precious crawl resources? Also, as we update and improve the javascript on our site, we iterate the version number ?v=1.1... 1.2... 1.3... etc. And the legacy versions show up in Google Webmaster Tools as 404s. For example: http://www.discoverafrica.com/js/global_functions.js?v=1.1
Technical SEO | | AndreVanKets
http://www.discoverafrica.com/js/jquery.cookie.js?v=1.1
http://www.discoverafrica.com/js/global.js?v=1.2
http://www.discoverafrica.com/js/jquery.validate.min.js?v=1.1
http://www.discoverafrica.com/js/json2.js?v=1.1 Wouldn't it just be easier to prevent Googlebot from crawling the js folder altogether? Isn't that what robots.txt was made for? Just to be clear - we are NOT doing any sneaky redirects or other dodgy javascript hacks. We're just trying to power our content and UX elegantly with javascript. What do you guys say: Obey Matt? Or run the javascript gauntlet?0