Robots.txt
-
Hello,
My client has a robots.txt file which says this:
User-agent: * Crawl-delay: 2 I put it through a robots checker which said that it must have a **disallow command**. So should it say this:
User-agent: *
Disallow:
crawl-delay: 2
What effect (if any) would not having a disallow command make?
Thanks
-
Oops, good catch Paul, you're correct!
-
Michael - you are _incorrect, _I'm afraid! You need to read up on the specifics of the robots exclusion protocol.
A blank Disallow directive absolutely does NOT match all URLs on the site. In order to match all URLs on the site, the configuration would have to be:
User-agent: * Disallow: /
Note the slash denoting the root of the site. If the field after disallow: is blank, that specifically means no URLs should be blocked. To quote www.robotstxt.org:
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The second part of that statement is equally important. For a record to be valid, it must include at least one user agent declaration and at least one disallow statement. If you want the file to not block any URLs, you must include the disallow: statement, but leave its value empty.
For more proof of this, here's the exact example, also from robotstxt.org:
To allow all robots complete access
User-agent: * Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
The main reason for including a robots.txt which doesn't block anything is to help clean up a server's error logs. With no robots.txt in place, an error will be inserted into the logs every time a crawler visits and can't find the file, bloating the logs and obscuring the real errors that might be present. A blank file may lead someone to believe that the robots.txt just hasn't been configured, leading to unnecessary confusion. So a file configured as above is preferable even if no blocking is desired.
Hope that clears things up?
Paul
[edited to replace line breaks in the code examples that were stripped out by Moz text editor]
-
Caroline,
REMOVE THE DISALLOW LINE.
I am concerned that that line will match all URLs on the site, and disallow the ENTIRE site.
Michael.
-
Thanks to both of you. I will recommend that the Robots.txt is changed to:
User-agent: *
Disallow:in order to configure it right and miss out the crawl delay.
Caroline
-
Your second version is correct, ALBA123 - the robots protocol does require you to include a disallow statement in order to be correctly configured, even if it's blank to indicate crawling the full site.
I really question the wisdom of having a crawl delay in place though. What's the reason for doing so? I never want anything to get in the way of the search crawlers "doing their thing" as effectively as possible.
It's also rather strange to go to a crawl delay, but not be blocking the crawling of any of the non-essential sections of the site. Usually a crawl delay is in place to reduce the resource use by crawlers (vastly better to improve the efficiency of the site or get stronger hosting) but delaying crawl for the whole site instead of saving resources by blocking the non-essential areas first is pretty heavy-handed.
Doers that make sense?
Paul
-
I'd be really, REALLY careful about a disallow statement like that: you run the risk of disallowing your entire website.
FYI I'm not sure putting a crawl delay in your robots.txt file is the right answer. I saw an example a week or so ago where Google (I think, but maybe it was Bing) explicitly said somewhere that it had ignored the crawl delay in the robots.txt. I would specify the crawl delay in Webmaster Tools instead. It's hard to find, but it's there
- in Webmaster Tools, select the site you want to set the crawl rate for
- click the Gear icon in the upper right
- you'll see the option there to set the crawl rate
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using one robots.txt for two websites
I have two websites that are hosted in the same CMS. Rather than having two separate robots.txt files (one for each domain), my web agency has created one which lists the sitemaps for both websites, like this: User-agent: * Disallow: Sitemap: https://www.siteA.org/sitemap Sitemap: https://www.siteB.com/sitemap Is this ok? I thought you needed one robots.txt per website which provides the URL for the sitemap. Will having both sitemap URLs listed in one robots.txt confuse the search engines?
Technical SEO | | ciehmoz0 -
No: 'noindex' detected in 'robots' meta tag
I'm getting an error in Search Console that pages on my site show No: 'noindex' detected in 'robots' meta tag. However, when I inspect the pages html, it does not show noindex. In fact, it shows index, follow. Majority of pages show the error and are not indexed by Google...Not sure why this is happening. Unfortunately I can't post images on here but I've linked some url's below. The page below in search console shows the error above... https://mixeddigitaleduconsulting.com/ As does this one. https://mixeddigitaleduconsulting.com/independent-school-marketing-communications/ However, this page does not have the error and is indexed by Google. The meta robots tag looks identical. https://mixeddigitaleduconsulting.com/blog/leadership-team/jill-goodman/ Any and all help is appreciated.
Technical SEO | | Sean_White_Consult0 -
What's wrong with this robots.txt
Hi. really struggling with the robots.txt file
Technical SEO | | Leonie-Kramer
this is it: User-agent: *
Disallow: /product/ #old sitemap
Disallow: /media/name.xml When testing in w3c.org everything looks good, testing is okay, but when uploading it to the server, Google webmaster tools gives 3 errors. Checked it with my collegue we both don't know what's wrong. Can someone take a look at this and give me the solution.
Thanx in advance! Leonie1 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
Using Robots.txt
I want to Block or prevent pages being accessed or indexed by googlebot. Please tell me if googlebot will NOT Access any URL that begins with my domain name, followed by a question mark,followed by any string by using Robots.txt below. Sample URL http://mydomain.com/?example User-agent: Googlebot Disallow: /?
Technical SEO | | semer0 -
Internal search : rel=canonical vs noindex vs robots.txt
Hi everyone, I have a website with a lot of internal search results pages indexed. I'm not asking if they should be indexed or not, I know they should not according to Google's guidelines. And they make a bunch of duplicated pages so I want to solve this problem. The thing is, if I noindex them, the site is gonna lose a non-negligible chunk of traffic : nearly 13% according to google analytics !!! I thought of blocking them in robots.txt. This solution would not keep them out of the index. But the pages appearing in GG SERPS would then look empty (no title, no description), thus their CTR would plummet and I would lose a bit of traffic too... The last idea I had was to use a rel=canonical tag pointing to the original search page (that is empty, without results), but it would probably have the same effect as noindexing them, wouldn't it ? (never tried so I'm not sure of this) Of course I did some research on the subject, but each of my finding recommanded one of the 3 methods only ! One even recommanded noindex+robots.txt block which is stupid because the noindex would then be useless... Is there somebody who can tell me which option is the best to keep this traffic ? Thanks a million
Technical SEO | | JohannCR0 -
OK to block /js/ folder using robots.txt?
I know Matt Cutts suggestions we allow bots to crawl css and javascript folders (http://www.youtube.com/watch?v=PNEipHjsEPU) But what if you have lots and lots of JS and you dont want to waste precious crawl resources? Also, as we update and improve the javascript on our site, we iterate the version number ?v=1.1... 1.2... 1.3... etc. And the legacy versions show up in Google Webmaster Tools as 404s. For example: http://www.discoverafrica.com/js/global_functions.js?v=1.1
Technical SEO | | AndreVanKets
http://www.discoverafrica.com/js/jquery.cookie.js?v=1.1
http://www.discoverafrica.com/js/global.js?v=1.2
http://www.discoverafrica.com/js/jquery.validate.min.js?v=1.1
http://www.discoverafrica.com/js/json2.js?v=1.1 Wouldn't it just be easier to prevent Googlebot from crawling the js folder altogether? Isn't that what robots.txt was made for? Just to be clear - we are NOT doing any sneaky redirects or other dodgy javascript hacks. We're just trying to power our content and UX elegantly with javascript. What do you guys say: Obey Matt? Or run the javascript gauntlet?0 -
Robots.txt Syntax
Does the order of the robots.txt syntax matter in SEO? For example (are there potential problems with this format): User-agent: * Sitemap: Disallow: /form.htm Allow: / Disallow: /cgnet_directory
Technical SEO | | RodrigoStockebrand0