Robots.txt
-
Hello,
My client has a robots.txt file which says this:
User-agent: * Crawl-delay: 2 I put it through a robots checker which said that it must have a **disallow command**. So should it say this:
User-agent: *
Disallow:
crawl-delay: 2
What effect (if any) would not having a disallow command make?
Thanks
-
Oops, good catch Paul, you're correct!
-
Michael - you are _incorrect, _I'm afraid! You need to read up on the specifics of the robots exclusion protocol.
A blank Disallow directive absolutely does NOT match all URLs on the site. In order to match all URLs on the site, the configuration would have to be:
User-agent: * Disallow: /
Note the slash denoting the root of the site. If the field after disallow: is blank, that specifically means no URLs should be blocked. To quote www.robotstxt.org:
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The second part of that statement is equally important. For a record to be valid, it must include at least one user agent declaration and at least one disallow statement. If you want the file to not block any URLs, you must include the disallow: statement, but leave its value empty.
For more proof of this, here's the exact example, also from robotstxt.org:
To allow all robots complete access
User-agent: * Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
The main reason for including a robots.txt which doesn't block anything is to help clean up a server's error logs. With no robots.txt in place, an error will be inserted into the logs every time a crawler visits and can't find the file, bloating the logs and obscuring the real errors that might be present. A blank file may lead someone to believe that the robots.txt just hasn't been configured, leading to unnecessary confusion. So a file configured as above is preferable even if no blocking is desired.
Hope that clears things up?
Paul
[edited to replace line breaks in the code examples that were stripped out by Moz text editor]
-
Caroline,
REMOVE THE DISALLOW LINE.
I am concerned that that line will match all URLs on the site, and disallow the ENTIRE site.
Michael.
-
Thanks to both of you. I will recommend that the Robots.txt is changed to:
User-agent: *
Disallow:in order to configure it right and miss out the crawl delay.
Caroline
-
Your second version is correct, ALBA123 - the robots protocol does require you to include a disallow statement in order to be correctly configured, even if it's blank to indicate crawling the full site.
I really question the wisdom of having a crawl delay in place though. What's the reason for doing so? I never want anything to get in the way of the search crawlers "doing their thing" as effectively as possible.
It's also rather strange to go to a crawl delay, but not be blocking the crawling of any of the non-essential sections of the site. Usually a crawl delay is in place to reduce the resource use by crawlers (vastly better to improve the efficiency of the site or get stronger hosting) but delaying crawl for the whole site instead of saving resources by blocking the non-essential areas first is pretty heavy-handed.
Doers that make sense?
Paul
-
I'd be really, REALLY careful about a disallow statement like that: you run the risk of disallowing your entire website.
FYI I'm not sure putting a crawl delay in your robots.txt file is the right answer. I saw an example a week or so ago where Google (I think, but maybe it was Bing) explicitly said somewhere that it had ignored the crawl delay in the robots.txt. I would specify the crawl delay in Webmaster Tools instead. It's hard to find, but it's there
- in Webmaster Tools, select the site you want to set the crawl rate for
- click the Gear icon in the upper right
- you'll see the option there to set the crawl rate
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Pages being flagged in Search Console as having a "no-index" tag, do not have a meta robots tag??
Hi, I am running a technical audit on a site which is causing me a few issues. The site is small and awkwardly built using lots of JS, animations and dynamic URL extensions (bit of a nightmare). I can see that it has only 5 pages being indexed in Google despite having over 25 pages submitted to Google via the sitemap in Search Console. The beta Search Console is telling me that there are 23 Urls marked with a 'noindex' tag, however when i go to view the page source and check the code of these pages, there are no meta robots tags at all - I have also checked the robots.txt file. Also, both Screaming Frog and Deep Crawl tools are failing to pick up these urls so i am a bit of a loss about how to find out whats going on. Inevitably i believe the creative agency who built the site had no idea about general website best practice, and that the dynamic url extensions may have something to do with the no-indexing. Any advice on this would be really appreciated. Are there any other ways of no-indexing pages which the dev / creative team might have implemented by accident? - What am i missing here? Thanks,
Technical SEO | | NickG-1230 -
What's wrong with this robots.txt
Hi. really struggling with the robots.txt file
Technical SEO | | Leonie-Kramer
this is it: User-agent: *
Disallow: /product/ #old sitemap
Disallow: /media/name.xml When testing in w3c.org everything looks good, testing is okay, but when uploading it to the server, Google webmaster tools gives 3 errors. Checked it with my collegue we both don't know what's wrong. Can someone take a look at this and give me the solution.
Thanx in advance! Leonie1 -
Best use of robots.txt for "garbage" links from Joomla!
I recently started out on Seomoz and is trying to make some cleanup according to the campaign report i received. One of my biggest gripes is the point of "Dublicate Page Content". Right now im having over 200 pages with dublicate page content. Now.. This is triggerede because Seomoz have snagged up auto generated links from my site. My site has a "send to freind" feature, and every time someone wants to send a article or a product to a friend via email a pop-up appears. Now it seems like the pop-up pages has been snagged by the seomoz spider,however these pages is something i would never want to index in Google. So i just want to get rid of them. Now to my question I guess the best solution is to make a general rule via robots.txt, so that these pages is not indexed and considered by google at all. But, how do i do this? what should my syntax be? A lof of the links looks like this, but has different id numbers according to the product that is being send: http://mywebshop.dk/index.php?option=com_redshop&view=send_friend&pid=39&tmpl=component&Itemid=167 I guess i need a rule that grabs the following and makes google ignore links that contains this: view=send_friend
Technical SEO | | teleman0 -
Robots.txt query
Quick question, if this appears in a clients robots.txt file, what does it mean? Disallow: /*/_/ Does it mean no pages can be indexed? I have checked and there are no pages in the index but it's a new site too so not sure if this is the problem. Thanks Karen
Technical SEO | | Karen_Dauncey0 -
What are your thoughts on security of placing CMS-related folders in a robots.txt file?
So I was just about to add a whole heap of CMS-related folders to my robots.txt file to exclude them from search, and thought "hey, I'm publicly telling people where my admin folders are"...surely that's not right?! Should I leave them out of the robots.txt file, and hope for the best that they never get indexed? Should I use noindex meta data on every page? What are people's thoughts? Thanks, James PS. I know this is similar to lots of other discussions around meta noindex vs. robots.txt, but I'm after specific thoughts around the security aspect of listing your admin folders in a robots.txt file...
Technical SEO | | James-Distinction0 -
Robots.txt
Hi everyone, I just want to check something. If you have this entered into your robots.txt file: User-agent: *
Technical SEO | | PeterM22
Disallow: /fred/ This wouldn't block /fred-review/ from being crawled would it? Thanks0 -
Does RogerBot read URL wildcards in robots.txt
I believe that the Google and Bing crawlbots understand wildcards for the "disallow" URL's in robots.txt - does Roger?
Technical SEO | | AspenFasteners0 -
Robots.txt File Redirects to Home Page
I've been doing some site analysis for a new SEO client and it has been brought to my attention that their robots.txt file redirects to their homepage. I was wondering: Is there a benfit to setup your robots.txt file to do this? Will this effect how their site will get indexed? Thanks for your response! Kyle Site URL: http://www.radisphere.net/
Technical SEO | | kchandler0