Allow or Disallow First in Robots.txt
-
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command?
example:
Allow: /models/ford///page*
Disallow: /models////page
-
Just caught this a bit late and probably to late to add something but my two pence is test it in Webmaster Tools, via Crawl -> Robot.txt tester - if you've not used this before simply add the url you want to test and Google highlights the directive that allows or disallows it.
-
Thank you Cyrus, yes, I have tried your suggested robots.txt checker and despite it validates the file, it shows me a couple of warnings about the "unusual" use of wildcard. It is my understanding that I would probably need to discuss all this with Google folks directly.
Thank you for you answer... and, yes Keri, I know this is a old thread, but still useful today!
Thanks
-
Can't say with 100% confidence, but sounds like it might work. You could always upload it to a server and use a robots.txt checker to validate, although sometimes the validator tools may incorporate slight differences in edge cases like this that make them moot.
-
Just a quick note, this question is actually from spring of 2012.
-
What about something like:
allow: /directory/$
disallow: /directory/*
Where I want this to be indexed:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
Ideas?
-
I really appreciate all that effort you put in to ensure your method was correct. many thanks.
-
Interesting question - I've had this discussion a couple of times with different SEOs. Here's my best understanding: There are actually 2 different answers - one if you are talking about Google, and one for every other search engine.
For most search engines, the "Allow" should come first. This is because the first matching pattern always wins, for the reasons Geoff stated.
But Google is different. They state:
"At a group-member level, in particular for
allow
anddisallow
directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined."Robots.txt Specifications - Webmasters — Google Developers
So for Google, order is not important, only the specificity of the rule based on the length of the entry. But the order of precedence for rules with wildcards is undefined.
This last part is important, because your directives contain wildcards. If I'm reading this right, your particular directives:
Allow: /models/ford///page*
Disallow: /models////pageSo if it's "undefined" which directive will Google follow, if order isn't important? Fortunately, there's a simple way to find out.Google Webmaster allows you to test any robots.txt file. I created a dummy file based on your rules, In this case, your directives worked perfectly no matter what order I put them in.
| http://cyrusshepard.com/models/ford/test/test/pages | Allowed by line 2: Allow: /models/ford///page* | Allowed by line 2: Allow: /models/ford///page* |
| http://cyrusshepard.com/models/chevy/test/test/pages | Blocked by line 3: Disallow: /models////page | Blocked by line 3: Disallow: /models////page |So, to summarize:1. Always put Allow directives first, as most search engines follow the "first rule counts" rule.2. Google doesn't care about order, but rather the specificity based on the length of the entry.3. The order of precedence for rules with wildcards is undefined.4. When in doubt, check your robots.txt file in Google Webmaster tools.Hope this helps.(sorry for the very long answer which basically says you were right all along
-
I understand your concern. I am basing my answer based on the fact that if you don't have a robots.txt at all, Google will still crawl you, which means its an allow by default. So all that matters in my opinion is the disallow, but because you need an allow from the wildcard disallow, you could allow that and disallow next.
Honestly, I don't think it matters. If you think the way a bot would work, it's not like robots.txt 1 line is read, then the bot goes crawling and then comes back reads the next line and so on. Does that make sense ? It reads all the lines in the robots.txt and then follows the directives. But to be sure, you can do either of the scenarios and see for yourself. I am sure the results would be same either way.
-
The allow directives need to come before the disallow directives for the same directory/file paths. (I have never personally tested this although it makes logical sense to instruct a robot to access one particular path within a directory structure before it sees that it is blocked from crawling that directory).
For example:-
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static
Disallow: /s2
As per how Google have formatted their robots.txt.
-
Thanks. I want to make sure I get this right in a syntax universally understood by all engines. I have seen webmasters all over the place on this one with some saying that crawlers use a first matching rule and others that say that crawlers use a last matching rule. I am almost thinking to have the allow command twice - before and after, to cover all bases.
-
I don't think it matters, but I think I would disallow first, because by default everything is an Allow.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Search console says 'sitemap is blocked by robots?
Google Search console is telling me "Sitemap contains URLs which are blocked by robots.txt." I don't understand why my sitemap is being blocked? My robots.txt look like this: User-Agent: *
Technical SEO | | Extima-Christian
Disallow: Sitemap: http://www.website.com/sitemap_index.xml It's a WordPress site, with Yoast SEO installed. Is anyone else having this issue with Google Search console? Does anyone know how I can fix this issue?1 -
Robots.txt Syntax for Dynamic URLs
I want to Disallow certain dynamic pages in robots.txt and am unsure of the proper syntax. The pages I want to disallow all include the string ?Page= Which is the proper syntax?
Technical SEO | | btreloar
Disallow: ?Page=
Disallow: ?Page=*
Disallow: ?Page=
Or something else?0 -
Robots.txt and Multiple Sitemaps
Hello, I have a hopefully simple question but I wanted to ask to get a "second opinion" on what to do in this situation. I am working on a clients robots.txt and we have multiple sitemaps. Using yoast I have my sitemap_index.xml and I also have a sitemap-image.xml I do put them in google and bing by hand but wanted to have it added into the robots.txt for insurance. So my question is, when having multiple sitemaps called out on a robots.txt file does it matter if one is before the other? From my reading it looks like you can have multiple sitemaps called out, but I wasn't sure the best practice when writing it up in the file. Example: User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-content/plugins/ Sitemap: http://sitename.com/sitemap_index.xml Sitemap: http://sitename.com/sitemap-image.xml Thanks a ton for the feedback, I really appreciate it! :) J
Technical SEO | | allstatetransmission0 -
Adding multi-language sitemaps to robots.txt
I am working on a revamped multi-language site that has moved to Magento. Each language runs off the core coding so there are no sub-directories per language. The developer has created sitemaps which have been uploaded to their respective GWT accounts. They have placed the sitemaps in new directories such as: /sitemap/uk/sitemap.xml /sitemap/de/sitemap.xml I want to add the sitemaps to the robots.txt but can't figure out how to do it. Also should they have placed the sitemaps in a single location with the file identifying each language: /sitemap/uk-sitemap.xml /sitemap/de-sitemap.xml What is the cleanest way of handling these sitemaps and can/should I get them on robots.txt?
Technical SEO | | MickEdwards0 -
Robots.txt file
How do i get Google to stop indexing my old pages and start indexing my new pages even months down the line? Do i need to install a Robots.txt file on each page?
Technical SEO | | gimes0 -
Empty Meta Robots Directive - Harmful?
Hi, We had a coding update and a side-effect of that was that our directive was emptied, in other words it now reads as: on all of the site. I've since noticed that Google's cache date on all of the pages - at least, the ones I tested - have a Cached date of no later than 17 December '12 - that's the Monday after the directive was removed on mass. So, A, does anyone have solid evidence of an empty directive causing problems? Past experience, Matt Cutts, Fishkin quote, etc. And then B - It seems fairly well correlated but, does my entire site's homogenous Cached date point to this tag removal? Or is it fairly normal to have a particular cache date across a large site (we're a large ecommerce site). Our site: http://www.zando.co.za/ I'm having the directive reinstated as soon as Dev permitting. And then, for extra credit, is there a way with Google's API, or perhaps some other tool, to run an arbitrary list and retrieve Cached dates? I'd want to do this for diagnosis purposes and preferably in a way that OK with Google. I'd avoid CURLing for the cached URL and scraping out that dates with BASH, or any such kind of thing. Cheers,
Technical SEO | | RocketZando0 -
RegEx help needed for robots.txt potential conflict
I've created a robots.txt file for a new Magento install and used an existing site-map that was on the Magento help forums but the trouble is I can't decipher something. It seems that I am allowing and disallowing access to the same expression for pagination. My robots.txt file (and a lot of other Magento site-maps it seems) includes both: Allow: /*?p= and Disallow: /?p=& I've searched for help on RegEx and I can't see what "&" does but it seems to me that I'm allowing crawler access to all pagination URLs, but then possibly disallowing access to all pagination URLs that include anything other than just the page number? I've looked at several resources and there is practically no reference to what "&" does... Can anyone shed any light on this, to ensure I am allowing suitable access to a shop? Thanks in advance for any assistance
Technical SEO | | MSTJames0 -
Mobile site - allow robot traffic
Hi, If a user comes to our site from a mobile device, we redirect to our mobile site. That is www.mysite/mypage redirects to m.mysite/mypage. Right now we are blocking robots from crawling our m. site. Previously there were concerns the m. site could rank for normal browser searches. To make sure this isn't a problem we are planning on rel canonical our m. site pages and reference the www pages (mobile is just a different version of our www site). From my understanding having a mobile version of a page is a ranking factor for mobile searches so allowing robots is a good thing. Before doing so, I wanted to see if anyone had any other suggestions/feedback (looking for potential pitfalls, issues etc)
Technical SEO | | NicB10