Robots.txt
-
Hello,
My client has a robots.txt file which says this:
User-agent: * Crawl-delay: 2 I put it through a robots checker which said that it must have a **disallow command**. So should it say this:
User-agent: *
Disallow:
crawl-delay: 2
What effect (if any) would not having a disallow command make?
Thanks
-
Oops, good catch Paul, you're correct!
-
Michael - you are _incorrect, _I'm afraid! You need to read up on the specifics of the robots exclusion protocol.
A blank Disallow directive absolutely does NOT match all URLs on the site. In order to match all URLs on the site, the configuration would have to be:
User-agent: * Disallow: /
Note the slash denoting the root of the site. If the field after disallow: is blank, that specifically means no URLs should be blocked. To quote www.robotstxt.org:
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The second part of that statement is equally important. For a record to be valid, it must include at least one user agent declaration and at least one disallow statement. If you want the file to not block any URLs, you must include the disallow: statement, but leave its value empty.
For more proof of this, here's the exact example, also from robotstxt.org:
To allow all robots complete access
User-agent: * Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
The main reason for including a robots.txt which doesn't block anything is to help clean up a server's error logs. With no robots.txt in place, an error will be inserted into the logs every time a crawler visits and can't find the file, bloating the logs and obscuring the real errors that might be present. A blank file may lead someone to believe that the robots.txt just hasn't been configured, leading to unnecessary confusion. So a file configured as above is preferable even if no blocking is desired.
Hope that clears things up?
Paul
[edited to replace line breaks in the code examples that were stripped out by Moz text editor]
-
Caroline,
REMOVE THE DISALLOW LINE.
I am concerned that that line will match all URLs on the site, and disallow the ENTIRE site.
Michael.
-
Thanks to both of you. I will recommend that the Robots.txt is changed to:
User-agent: *
Disallow:in order to configure it right and miss out the crawl delay.
Caroline
-
Your second version is correct, ALBA123 - the robots protocol does require you to include a disallow statement in order to be correctly configured, even if it's blank to indicate crawling the full site.
I really question the wisdom of having a crawl delay in place though. What's the reason for doing so? I never want anything to get in the way of the search crawlers "doing their thing" as effectively as possible.
It's also rather strange to go to a crawl delay, but not be blocking the crawling of any of the non-essential sections of the site. Usually a crawl delay is in place to reduce the resource use by crawlers (vastly better to improve the efficiency of the site or get stronger hosting) but delaying crawl for the whole site instead of saving resources by blocking the non-essential areas first is pretty heavy-handed.
Doers that make sense?
Paul
-
I'd be really, REALLY careful about a disallow statement like that: you run the risk of disallowing your entire website.
FYI I'm not sure putting a crawl delay in your robots.txt file is the right answer. I saw an example a week or so ago where Google (I think, but maybe it was Bing) explicitly said somewhere that it had ignored the crawl delay in the robots.txt. I would specify the crawl delay in Webmaster Tools instead. It's hard to find, but it's there
- in Webmaster Tools, select the site you want to set the crawl rate for
- click the Gear icon in the upper right
- you'll see the option there to set the crawl rate
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
X-robots tag causing no index issues
I have an interesting problem with a site which has an x-robot tag blocking the site from being indexed, the site is in Wordpress, there are no issues with the robots.txt or at the page level, I cant find the noindex anywhere. I removed the SEO plug-in which was there and installed Yoast but it made no difference. this is the url: https://www.cotswoldflatroofing.com/ Its coming up with a HTTP error: x-robots tag noindex, nofollow, noarchive
Technical SEO | | Donsimong0 -
Shopify robots blocking stylesheets causing inconsistent mobile-friendly test results?
One of our shopify sites suffered an extreme rankings drop. Recent Google algorithm updates include mobile first so I tested the site and our team got different mobile-friendly test results. However, search console is also flagging pages as not mobile friendly. So, while us end-users see the site as OK on mobile, this may not be the case for Google? I researched more about inconsistent mobile test results and found answers that say it may be due to robots.txt blocking stylesheets. Do you recognise any directory blocked that might be affecting Google's rendering? We can't edit shopify robots.txt unfortunately. Our dev said the only thing that stands out to him is Disallow: /design_theme_id and the rest shouldn't be hindering Google bots. Here are some of the files blocked: Disallow: /admin
Technical SEO | | nhhernandez
Disallow: /cart
Disallow: /orders
Disallow: /checkout
Disallow: /9103034/checkouts
Disallow: /9103034/orders
Disallow: /carts
Disallow: /account
Disallow: /collections/+
Disallow: /collections/%2B
Disallow: /collections/%2b
Disallow: /blogs/+
Disallow: /blogs/%2B
Disallow: /blogs/%2b
Disallow: /design_theme_id
Disallow: /preview_theme_id
Disallow: /preview_script_id
Disallow: /discount/*
Disallow: /gift_cards/*
Disallow: /apple-app-site-association0 -
Robots.txt Disallow: / in Search Console
Two days ago I found out through search console that my website's Robots.txt has changed to User-agent: *
Technical SEO | | RAN_SEO
Disallow: / When I check the robots.txt in the website it looks fine - I see its blocked just in search console( in the robots.txt tester). when I try to do fetch as google to the homepage I see its blocked. Any ideas why would robots.txt block my website? it was fine until the weekend. before that, in the last 3 months I saw I had blocked resources in the website and I brought back pages with fetch as google. Any ideas?0 -
Robots.txt and Magento
HI, I am working on getting my robots.txt up and running and I'm having lots of problems with the robots.txt my developers generated. www.plasticplace.com/robots.txt I ran the robots.txt through a syntax checking tool (http://www.sxw.org.uk/computing/robots/check.html) This is what the tool came back with: http://www.dcs.ed.ac.uk/cgi/sxw/parserobots.pl?site=plasticplace.com There seems to be many errors on the file. Additionally, I looked at our robots.txt in the WMT and they said the crawl was postponed because the robots.txt is inaccessible. What does that mean? A few questions: 1. Is there a need for all the lines of code that have the “#” before it? I don’t think it’s necessary but correct me if I'm wrong. 2. Furthermore, why are we blocking so many things on our website? The robots can’t get past anything that requires a password to access anyhow but again correct me if I'm wrong. 3. Is there a reason Why can't it just look like this: User-agent: * Disallow: /onepagecheckout/ Disallow: /checkout/cart/ I do understand that Magento has certain folders that you don't want crawled, but is this necessary and why are there so many errors?
Technical SEO | | EcomLkwd0 -
How ro write a robots txt file to point to your site map
Good afternoon from still wet & humid wetherby UK... I want to write a robots text file that instruct the bots to index everything and give a specific location to the sitemap. The sitemap url is:http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Is this correct: User-agent: *
Technical SEO | | Nightwing
Disallow:
SITEMAP: http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Any insight welcome 🙂0 -
Robots.txt versus sitemap
Hi everyone, Lets say we have a robots.txt that disallows specific folders on our website, but a sitemap submitted in Google Webmaster Tools that lists content in those folders. Who wins? Will the sitemap content get indexed even if it's blocked by robots.txt? I know content that is blocked by robot.txt can still get indexed and display a URL if Google discovers it via a link so I'm wondering if that would happen in this scenario too. Thanks!
Technical SEO | | anthematic0 -
Robots exclusion
Hi All, I have an issue whereby print versions of my articles are being flagged up as "duplicate" content / page titles. In order to get around this, I feel that the easiest way is to just add them to my robots.txt document with a disallow. Here is my URL make up: Normal article: www.mysite.com/displayarticle=12345 Print version of my article www.mysite.com/displayarticle=12345&printversion=yes I know that having dynamic parameters in my URL is not best practise to say the least, but I'm stuck with this for the time being... My question is, how do I add just the print versions of articles to my robots file without disallowing articles too? Can I just add the parameter to the document like so? Disallow: &printversion=yes I also know that I can do add a meta noindex, nofollow tag into the head of my print versions, but I feel a robots.txt disallow will be somewhat easier... Many thanks in advance. Matt
Technical SEO | | Horizon0 -
Robots.txt
Hi there, My question relates to the robots.txt file. This statement: /*/trackback Would this block domain.com/trackback and domain.com/fred/trackback ? Peter
Technical SEO | | PeterM220