Robots.txt Syntax
-
Does the order of the robots.txt syntax matter in SEO?
For example (are there potential problems with this format):
User-agent: * Sitemap: Disallow: /form.htm Allow: / Disallow: /cgnet_directory
-
Rodrigo -
Thanks, and thanks for the follow-up. To be honest with you though...I have not seen or experienced anything about this. I tend to follow the suggested rules with code
So my answer is "I don't know". Anyone else know?
I also agree with you on the meta tags. Robots.txt is best used for disallowing folders and such, not pages. For instance, I might do a "Disallow: /admin" in the robots.txt file, but would never block a category page or something to that effect. If I wanted to remove it from the index, I'd also use the meta "noindex,follow" attribute. Good point!
-
Thanks John- good response. I think the biggest takeaway for me is to know that none of the "dis-order" above will actually cause errors in the file. However, I completely agree with your recommendations as to where the sitemap: should go, and why the allow parameter is unnecessary.
Last question, do you know if the blank line in-between the allow: and second disallow: parameter cause any issues?
side note for those using the robots.txt to block content, also consider the noindex,follow attribute in the META tag as an alternative to save some link value that those pages may be getting.
-
Rodrigo -
Good question. The syntax does in fact matter, though not necessarily for SEO rankings. It matters because if you screw up your robots.txt, you can inadvertently disallow your whole site (I did it last week. Not pretty. Blog post forthcoming).
To get to your question, it is usually best to put the "Sitemap: " line at the bottom of the robots.txt, but it is not required to have it there, so far as I know.
You do not need the Allow: / parameter, because if you leave it out, Google assumes that you want everything indexed except what is put in the "Disallow: " lines.
In your case, you are disallowing "http://www.site.com/form.htm" and everything in your cgnet_directory folder. If you want everything in these folders hidden from crawlers...you have done exactly what you need to do.
I'm still learning about this, so I'm open to any correction the rest of the community has.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt Tester - syntax not understood
I've looked in the robots.txt Tester and I can see 3 warnings: There is a 'syntax not understood' warning for each of these. XML Sitemaps:
Technical SEO | | JamesHancocks1
https://www.pkeducation.co.uk/post-sitemap.xml
https://www.pkeducation.co.uk/sitemap_index.xml How do I fix or reformat these to remove the warnings? Many thanks in advance.
Jim0 -
XHTML tag syntax for rel=alternate hreflang
Is there a difference in the below two tags? My dev team is saying the first can be implemented (technical issue on their end), even though second is preferable, according to support.google.com, in the below two sitemap hreflang notations. My question is, will the first xhtml tag work for Google? Appreciate the input. <xhtml:link href="<a href="http://store.hp.com/CanadaStore/" rel="nofollow" target="_blank">http://store.hp.com/CanadaStore/" hreflang="en-ca" rel="alternate" /></xhtml:link href="<a> <xhtml:link href="<a href=" http:="" store.hp.com="" canadastore="" "="" rel="nofollow" target="_blank">http://store.hp.com/CanadaStore/" rel="alternate" hreflang="en-ca" /></xhtml:link >
Technical SEO | | ZachKline0 -
Robots.txt Disallow: / in Search Console
Two days ago I found out through search console that my website's Robots.txt has changed to User-agent: *
Technical SEO | | RAN_SEO
Disallow: / When I check the robots.txt in the website it looks fine - I see its blocked just in search console( in the robots.txt tester). when I try to do fetch as google to the homepage I see its blocked. Any ideas why would robots.txt block my website? it was fine until the weekend. before that, in the last 3 months I saw I had blocked resources in the website and I brought back pages with fetch as google. Any ideas?0 -
Adding your sitemap to robots.txt
Hi everyone, Best practice question: When adding your sitemap to your robots.txt file, do you add the whole sitemap at once or do you add different subcategories (products, posts, categories,..) separately? I'm very curious to hear your thoughts!
Technical SEO | | WeAreDigital_BE0 -
Is there a limit to how many URLs you can put in a robots.txt file?
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda. Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
Technical SEO | | kcb81780 -
Robots.txt anomaly
Hi, I'm monitoring a site thats had a new design relaunch and new robots.txt added. Over the period of a week (since launch) webmaster tools has shown a steadily increasing number of blocked urls (now at 14). In the robots.txt file though theres only 12 lines with the disallow command, could this be occurring because a line in the command could refer to more than one page/url ? They all look like single urls for example: Disallow: /wp-content/plugins
Technical SEO | | Dan-Lawrence
Disallow: /wp-content/cache
Disallow: /wp-content/themes etc, etc And is it normal for webmaster tools reporting of robots.txt blocked urls to steadily increase in number over time, as opposed to being identified straight away ? Thanks in advance for any help/advice/clarity why this may be happening ? Cheers Dan0 -
Same URL in "Duplicate Content" and "Blocked by robots.txt"?
How can the same URL show up in Seomoz Crawl Diagnostics "Most common errors and warnings" in both the "Duplicate Content"-list and the "Blocked by robots.txt"-list? Shouldnt the latter exclude it from the first list?
Technical SEO | | alsvik0 -
Getting home page content at top of what robots see
When I click on the text-only cache of nlpca(dot)com on the home page http://webcache.googleusercontent.com/search?q=cache:UIJER7OJFzYJ:www.nlpca.com/&hl=en&gl=us&strip=1 our H1 and body content are at the very bottom. How do we get the h1 and content at the top of what the robots see? Thanks!
Technical SEO | | BobGW0