Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Allow or Disallow First in Robots.txt
-
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command?
example:
Allow: /models/ford///page*
Disallow: /models////page
-
Just caught this a bit late and probably to late to add something but my two pence is test it in Webmaster Tools, via Crawl -> Robot.txt tester - if you've not used this before simply add the url you want to test and Google highlights the directive that allows or disallows it.
-
Thank you Cyrus, yes, I have tried your suggested robots.txt checker and despite it validates the file, it shows me a couple of warnings about the "unusual" use of wildcard. It is my understanding that I would probably need to discuss all this with Google folks directly.
Thank you for you answer... and, yes Keri, I know this is a old thread, but still useful today!
Thanks
-
Can't say with 100% confidence, but sounds like it might work. You could always upload it to a server and use a robots.txt checker to validate, although sometimes the validator tools may incorporate slight differences in edge cases like this that make them moot.
-
Just a quick note, this question is actually from spring of 2012.
-
What about something like:
allow: /directory/$
disallow: /directory/*
Where I want this to be indexed:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
Ideas?
-
I really appreciate all that effort you put in to ensure your method was correct. many thanks.
-
Interesting question - I've had this discussion a couple of times with different SEOs. Here's my best understanding: There are actually 2 different answers - one if you are talking about Google, and one for every other search engine.
For most search engines, the "Allow" should come first. This is because the first matching pattern always wins, for the reasons Geoff stated.
But Google is different. They state:
"At a group-member level, in particular for
allow
anddisallow
directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined."Robots.txt Specifications - Webmasters — Google Developers
So for Google, order is not important, only the specificity of the rule based on the length of the entry. But the order of precedence for rules with wildcards is undefined.
This last part is important, because your directives contain wildcards. If I'm reading this right, your particular directives:
Allow: /models/ford///page*
Disallow: /models////pageSo if it's "undefined" which directive will Google follow, if order isn't important? Fortunately, there's a simple way to find out.Google Webmaster allows you to test any robots.txt file. I created a dummy file based on your rules, In this case, your directives worked perfectly no matter what order I put them in.
| http://cyrusshepard.com/models/ford/test/test/pages | Allowed by line 2: Allow: /models/ford///page* | Allowed by line 2: Allow: /models/ford///page* |
| http://cyrusshepard.com/models/chevy/test/test/pages | Blocked by line 3: Disallow: /models////page | Blocked by line 3: Disallow: /models////page |So, to summarize:1. Always put Allow directives first, as most search engines follow the "first rule counts" rule.2. Google doesn't care about order, but rather the specificity based on the length of the entry.3. The order of precedence for rules with wildcards is undefined.4. When in doubt, check your robots.txt file in Google Webmaster tools.Hope this helps.(sorry for the very long answer which basically says you were right all along
-
I understand your concern. I am basing my answer based on the fact that if you don't have a robots.txt at all, Google will still crawl you, which means its an allow by default. So all that matters in my opinion is the disallow, but because you need an allow from the wildcard disallow, you could allow that and disallow next.
Honestly, I don't think it matters. If you think the way a bot would work, it's not like robots.txt 1 line is read, then the bot goes crawling and then comes back reads the next line and so on. Does that make sense ? It reads all the lines in the robots.txt and then follows the directives. But to be sure, you can do either of the scenarios and see for yourself. I am sure the results would be same either way.
-
The allow directives need to come before the disallow directives for the same directory/file paths. (I have never personally tested this although it makes logical sense to instruct a robot to access one particular path within a directory structure before it sees that it is blocked from crawling that directory).
For example:-
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static
Disallow: /s2
As per how Google have formatted their robots.txt.
-
Thanks. I want to make sure I get this right in a syntax universally understood by all engines. I have seen webmasters all over the place on this one with some saying that crawlers use a first matching rule and others that say that crawlers use a last matching rule. I am almost thinking to have the allow command twice - before and after, to cover all bases.
-
I don't think it matters, but I think I would disallow first, because by default everything is an Allow.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl solutions for landing pages that don't contain a robots.txt file?
My site (www.nomader.com) is currently built on Instapage, which does not offer the ability to add a robots.txt file. I plan to migrate to a Shopify site in the coming months, but for now the Instapage site is my primary website. In the interim, would you suggest that I manually request a Google crawl through the search console tool? If so, how often? Any other suggestions for countering this Meta Noindex issue?
Technical SEO | | Nomader1 -
How to allow bots to crawl all but WP-content
Hello, I would like my website to remain crawlable to bots, but to block my wp content and media. Does the following robots.txt work? I worry that the * user agent may conflict with the others. User-agent: *
Technical SEO | | Tom3_15
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/ User-agent: GoogleBot
Allow: / User-agent: GoogleBot-Mobile
Allow: / User-agent: GoogleBot-Image
Allow: / User-agent: Bingbot
Allow: / User-agent: Slurp
Allow: /0 -
Should I Focus on Video Schema or a Video Sitemap First
Hey all, I'm working on a website that is soon going to launch a video hub that contains over 500 videos. I'm interested in ensuring that the videos show up on the SERPs page in the highest position possible. I know Google recommends that you have on-page schema for your videos as well as an XML sitemap so they can be indexed for SERP. When I look at schema and the XML video sitemap they seem to communicate very similar kinds of information (Title, Description, Thumbnail, Duration). I'm not sure which one to start with; is it more important to have video schema or a video sitemap? Additionally, if anyone knows of any good video sitemap generators (free is best, but cheap is okay too) then please let me know. Cursory google searching has just churned up a number of tools that look sketchy.
Technical SEO | | perfectsearch710 -
Log in, sign up, user registration and robots
Hi all, We have an accommodation site that asks users only to register when they want to book a room, in the last step. Though this is the ideal situation when you have tons of users, nowadays we are having around 1500 - 2000 per day and making tests we found out that if we ask for a registration (simple, 1 click FB) we mail them all and through a good customer service we are increasing our sales. That is why, we would like to ask users to register right after the home page ie Home/accommodation or and all the rest. I am not sure how can I make to make that content still visible to robots.
Technical SEO | | Eurasmus.com
Will the authentication process block google crawling it? Maybe something we can do? We are not completely sure how to proceed so any tip would be appreciated. Thank you all for answering.3 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
Adding multi-language sitemaps to robots.txt
I am working on a revamped multi-language site that has moved to Magento. Each language runs off the core coding so there are no sub-directories per language. The developer has created sitemaps which have been uploaded to their respective GWT accounts. They have placed the sitemaps in new directories such as: /sitemap/uk/sitemap.xml /sitemap/de/sitemap.xml I want to add the sitemaps to the robots.txt but can't figure out how to do it. Also should they have placed the sitemaps in a single location with the file identifying each language: /sitemap/uk-sitemap.xml /sitemap/de-sitemap.xml What is the cleanest way of handling these sitemaps and can/should I get them on robots.txt?
Technical SEO | | MickEdwards0 -
No indexing url including query string with Robots txt
Dear all, how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt? Thanks!
Technical SEO | | HMK-NL0 -
Robots.txt Sitemap with Relative Path
Hi Everyone, In robots.txt, can the sitemap be indicated with a relative path? I'm trying to roll out a robots file to ~200 websites, and they all have the same relative path for a sitemap but each is hosted on its own domain. Basically I'm trying to avoid needing to create 200 different robots.txt files just to change the domain. If I do need to do that, though, is there an easier way than just trudging through it?
Technical SEO | | MRCSearch0