Robots.txt
-
Hello,
My client has a robots.txt file which says this:
User-agent: * Crawl-delay: 2 I put it through a robots checker which said that it must have a **disallow command**. So should it say this:
User-agent: *
Disallow:
crawl-delay: 2
What effect (if any) would not having a disallow command make?
Thanks
-
Oops, good catch Paul, you're correct!
-
Michael - you are _incorrect, _I'm afraid! You need to read up on the specifics of the robots exclusion protocol.
A blank Disallow directive absolutely does NOT match all URLs on the site. In order to match all URLs on the site, the configuration would have to be:
User-agent: * Disallow: /
Note the slash denoting the root of the site. If the field after disallow: is blank, that specifically means no URLs should be blocked. To quote www.robotstxt.org:
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The second part of that statement is equally important. For a record to be valid, it must include at least one user agent declaration and at least one disallow statement. If you want the file to not block any URLs, you must include the disallow: statement, but leave its value empty.
For more proof of this, here's the exact example, also from robotstxt.org:
To allow all robots complete access
User-agent: * Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
The main reason for including a robots.txt which doesn't block anything is to help clean up a server's error logs. With no robots.txt in place, an error will be inserted into the logs every time a crawler visits and can't find the file, bloating the logs and obscuring the real errors that might be present. A blank file may lead someone to believe that the robots.txt just hasn't been configured, leading to unnecessary confusion. So a file configured as above is preferable even if no blocking is desired.
Hope that clears things up?
Paul
[edited to replace line breaks in the code examples that were stripped out by Moz text editor]
-
Caroline,
REMOVE THE DISALLOW LINE.
I am concerned that that line will match all URLs on the site, and disallow the ENTIRE site.
Michael.
-
Thanks to both of you. I will recommend that the Robots.txt is changed to:
User-agent: *
Disallow:in order to configure it right and miss out the crawl delay.
Caroline
-
Your second version is correct, ALBA123 - the robots protocol does require you to include a disallow statement in order to be correctly configured, even if it's blank to indicate crawling the full site.
I really question the wisdom of having a crawl delay in place though. What's the reason for doing so? I never want anything to get in the way of the search crawlers "doing their thing" as effectively as possible.
It's also rather strange to go to a crawl delay, but not be blocking the crawling of any of the non-essential sections of the site. Usually a crawl delay is in place to reduce the resource use by crawlers (vastly better to improve the efficiency of the site or get stronger hosting) but delaying crawl for the whole site instead of saving resources by blocking the non-essential areas first is pretty heavy-handed.
Doers that make sense?
Paul
-
I'd be really, REALLY careful about a disallow statement like that: you run the risk of disallowing your entire website.
FYI I'm not sure putting a crawl delay in your robots.txt file is the right answer. I saw an example a week or so ago where Google (I think, but maybe it was Bing) explicitly said somewhere that it had ignored the crawl delay in the robots.txt. I would specify the crawl delay in Webmaster Tools instead. It's hard to find, but it's there
- in Webmaster Tools, select the site you want to set the crawl rate for
- click the Gear icon in the upper right
- you'll see the option there to set the crawl rate
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Website URL, Robots.txt and Google Search Console (www. vs non www.)
Hi MOZ Community,
Technical SEO | | Badiuzz
I would like to request your kind assistance on domain URLs - www. VS non www. Recently, my team have moved to a new website where a 301 Redirection has been done. Original URL : https://www.example.com.my/ (with www.) New URL : https://example.com.my/ (without www.) Our current robots.txt sitemap : https://www.example.com.my/sitemap.xml (with www.)
Our Google Search Console property : https://www.example.com.my/ (with www.) Question:
1. How/Should I standardize these so that Google crawler can effectively crawl my website?
2. Do I have to change back my website URLs to (with www.) or I just need to update my robots.txt?
3. How can I update my Google Search Console property to reflect accordingly (without www.), because I cannot see the options in the dashboard.
4. Is there any to dos such as Canonicalization needed, or should I wait for Google to automatically detect and change it, especially in GSC property? Really appreciate your kind assistance. Thank you,
Badiuzz0 -
Little confused regarding robots.txt
Hi there Mozzers! As a newbie, I have a question that what could happen if I write my robots.txt file like this... User-agent: * Allow: / Disallow: /abc-1/ Disallow: /bcd/ Disallow: /agd1/ User-agent: * Disallow: / Hope to hear from you...
Technical SEO | | DenorL0 -
What's wrong with this robots.txt
Hi. really struggling with the robots.txt file
Technical SEO | | Leonie-Kramer
this is it: User-agent: *
Disallow: /product/ #old sitemap
Disallow: /media/name.xml When testing in w3c.org everything looks good, testing is okay, but when uploading it to the server, Google webmaster tools gives 3 errors. Checked it with my collegue we both don't know what's wrong. Can someone take a look at this and give me the solution.
Thanx in advance! Leonie1 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
Robots.txt checker
Google seems to have discontinued their robots.txt checker. Is there another tool that I can use to check my text instead? Thanks!
Technical SEO | | theLotter0 -
Robots.txt and joomla
Hello, I use joomla for my website and automatically all those files are blocked is that good or bad, so I remove anything and if so why ? User-agent: *
Technical SEO | | seoanalytics
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/ I also added to my robots.txt files my email address ( is that useful, I am afraid google passes PR to the email address )
and a javascript: void (0) because I have tabs on my webpage ( is that useful )
as well as a .pdf ( is it also useful ) any comments ? does anything need to be changed or is it ok ? Thank you,0 -
Magento Robots & overly dynamic URL-s
How can i block all URL-s on a Magento store that have 2 or more dynamic parameters in it, since all the parameters have attribute name in it and not some uniform ID Would something like: Disallow: /?&* work? Since the only thing that is constant throughout all the custom parameters is that they are separated with "&" Thanks 🙂
Technical SEO | | tilenkrivec0 -
How many times robots.txt gets visited by crawlers, especially Google?
Hi, Do you know if there's any way to track how often robots.txt file has been crawled? I know we can check when is the latest downloaded from webmaster tool, but I actually want to know if they download every time crawlers visit any page on the site (e.g. hundreds of thousands of times every day), or less. thanks...
Technical SEO | | linklater0