Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Allow or Disallow First in Robots.txt
-
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command?
example:
Allow: /models/ford///page*
Disallow: /models////page
-
Just caught this a bit late and probably to late to add something but my two pence is test it in Webmaster Tools, via Crawl -> Robot.txt tester - if you've not used this before simply add the url you want to test and Google highlights the directive that allows or disallows it.
-
Thank you Cyrus, yes, I have tried your suggested robots.txt checker and despite it validates the file, it shows me a couple of warnings about the "unusual" use of wildcard. It is my understanding that I would probably need to discuss all this with Google folks directly.
Thank you for you answer... and, yes Keri, I know this is a old thread, but still useful today!
Thanks
-
Can't say with 100% confidence, but sounds like it might work. You could always upload it to a server and use a robots.txt checker to validate, although sometimes the validator tools may incorporate slight differences in edge cases like this that make them moot.
-
Just a quick note, this question is actually from spring of 2012.
-
What about something like:
allow: /directory/$
disallow: /directory/*
Where I want this to be indexed:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
Ideas?
-
I really appreciate all that effort you put in to ensure your method was correct. many thanks.
-
Interesting question - I've had this discussion a couple of times with different SEOs. Here's my best understanding: There are actually 2 different answers - one if you are talking about Google, and one for every other search engine.
For most search engines, the "Allow" should come first. This is because the first matching pattern always wins, for the reasons Geoff stated.
But Google is different. They state:
"At a group-member level, in particular for
allow
anddisallow
directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined."Robots.txt Specifications - Webmasters — Google Developers
So for Google, order is not important, only the specificity of the rule based on the length of the entry. But the order of precedence for rules with wildcards is undefined.
This last part is important, because your directives contain wildcards. If I'm reading this right, your particular directives:
Allow: /models/ford///page*
Disallow: /models////pageSo if it's "undefined" which directive will Google follow, if order isn't important? Fortunately, there's a simple way to find out.Google Webmaster allows you to test any robots.txt file. I created a dummy file based on your rules, In this case, your directives worked perfectly no matter what order I put them in.
| http://cyrusshepard.com/models/ford/test/test/pages | Allowed by line 2: Allow: /models/ford///page* | Allowed by line 2: Allow: /models/ford///page* |
| http://cyrusshepard.com/models/chevy/test/test/pages | Blocked by line 3: Disallow: /models////page | Blocked by line 3: Disallow: /models////page |So, to summarize:1. Always put Allow directives first, as most search engines follow the "first rule counts" rule.2. Google doesn't care about order, but rather the specificity based on the length of the entry.3. The order of precedence for rules with wildcards is undefined.4. When in doubt, check your robots.txt file in Google Webmaster tools.Hope this helps.(sorry for the very long answer which basically says you were right all along
-
I understand your concern. I am basing my answer based on the fact that if you don't have a robots.txt at all, Google will still crawl you, which means its an allow by default. So all that matters in my opinion is the disallow, but because you need an allow from the wildcard disallow, you could allow that and disallow next.
Honestly, I don't think it matters. If you think the way a bot would work, it's not like robots.txt 1 line is read, then the bot goes crawling and then comes back reads the next line and so on. Does that make sense ? It reads all the lines in the robots.txt and then follows the directives. But to be sure, you can do either of the scenarios and see for yourself. I am sure the results would be same either way.
-
The allow directives need to come before the disallow directives for the same directory/file paths. (I have never personally tested this although it makes logical sense to instruct a robot to access one particular path within a directory structure before it sees that it is blocked from crawling that directory).
For example:-
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static
Disallow: /s2
As per how Google have formatted their robots.txt.
-
Thanks. I want to make sure I get this right in a syntax universally understood by all engines. I have seen webmasters all over the place on this one with some saying that crawlers use a first matching rule and others that say that crawlers use a last matching rule. I am almost thinking to have the allow command twice - before and after, to cover all bases.
-
I don't think it matters, but I think I would disallow first, because by default everything is an Allow.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
I have two robots.txt pages for www and non-www version. Will that be a problem?
There are two robots.txt pages. One for www version and another for non-www version though I have moved to the non-www version.
Technical SEO | | ramb0 -
Robots.txt Tester - syntax not understood
I've looked in the robots.txt Tester and I can see 3 warnings: There is a 'syntax not understood' warning for each of these. XML Sitemaps:
Technical SEO | | JamesHancocks1
https://www.pkeducation.co.uk/post-sitemap.xml
https://www.pkeducation.co.uk/sitemap_index.xml How do I fix or reformat these to remove the warnings? Many thanks in advance.
Jim0 -
Robot.txt : How to block a specific file type in several subdirectories ?
Hello everyone ! I need help setting up a robot.txt. I'm trying to block all pdf files in particular directories so I'm using this command. In the example below the line is blocking all .gif in the entire site. Block files of a specific file type (for example, .gif) | Disallow: /*.gif$ 2 questions : Can I use this command to specify one particular directory in which I want to block pdf files ? Will this line be recognized by googlebots ? Disallow: /fileadmin/xxxxxxx/xxx/xxxxxxx/*.pdf$ Then I realized that I would have to write as many lines as many directories there are in which I want to block pdf files. Let's say I want to block pdf files in all these 3 directories /fileadmin/directory1 /fileadmin/directory1/sub1 /fileadmin/directory1/sub1/pdf Is there a pattern-matching rule I could use to blocks access to pdf files in all subdirectories instead of writing 3x the above line for each subdirectory ? For exemple : Disallow: /fileadmin/directory1*/ Many thanks in advance for any insight you may have.
Technical SEO | | LabeliumUSA0 -
Good robots txt for magento
Dear Communtiy, I am trying to improve the SEO ratings for my website www.rijwielcashencarry.nl (magento). My next step will be implementing robots txt to exclude some crawling pages.
Technical SEO | | rijwielcashencarry040
Does anybody have a good magento robots txt for me? And what need i copy exactly? Thanks everybody! Greetings, Bob0 -
Blocked jquery in Robots.txt, Any SEO impact?
I've heard that Google is now indexing links and stuff available in javascript and jquery. My webmastertools is showing that some links are blocked in robots.txt of jquery. Sorry I'm not a developer or designer. I want to know is there any impact of this on my SEO? and also how can I unblock it for the robots? Check this screenshot: http://i.imgur.com/3VDWikC.png
Technical SEO | | hammadrafique0 -
Is there any value in having a blank robots.txt file?
I've read an audit where the writer recommended creating and uploading a blank robots.txt file, there was no current file in place. Is there any merit in having a blank robots.txt file? What is the minimum you would include in a basic robots.txt file?
Technical SEO | | NicDale0 -
Googlebot does not obey robots.txt disallow
Hi Mozzers! We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter. We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming. But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems. What could be the reason? Best regards, Martin
Technical SEO | | TalkInThePark0 -
First click on SEO redirecting to a competitor site?
I just experienced something VERY odd and wondered if any of you had an idea of what it might be. When I did a search on Google and clicked the top SEO listing I was taken to a competitor of the number 1 listed site i.e. NOT the site I clicked on. When I clicked the back button and clicked it again, I was taken to the correct site. This happened with two different searches and I was taken to two different sites. Could this be a clever/sinister cookie implemented by the competitor; a site I frequent regularly? Could this be malware implemented by an affiliate? Could this be a Google glitch?
Technical SEO | | Red_Mud_Rookie0