Robots.txt Blocking - Best Practices
-
Hi All,
We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line.
We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files?
Thanks!
User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
-
Thanks for taking the time to respond in depth, GreenStone. We appreciate the advice and have passed your response along to the web hosting company (along with a frustrated email) explaining they're not adhering to anyone's best practices. Hopefully this will convince them!
-
Thanks, Dmitrii for your response! From our research we've seen similar recommendations and it helps to have more evidence to back it up. Hopefully these guys will give in a bit!
-
Completely agree, I really wouldn't want to host my stuff with a company that can't figure out what really the best practices are ;-). This is very well layed out why you shouldn't want to set up your robots.txt like it is right now.
-
In general, I definitely wouldn't recommend the way the web-provider is handling this.
- Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
- It makes a lot more sense to just allow crawlers full access, and then add crawl delays for non google crawlers, in addition to disallowing those specific sub-folders: Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp.
- Googlebot Disallow: Crawl-delay: 5, does not do you any good, as google does not obey these commands. Only Search Console can control this.
- You can test what is visible to googlebot within search console's "robots" subsection, in order to verify what they can access.
- Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
-
Here is another video from Matt - https://www.youtube.com/watch?v=I2giR-WKUfY
Lots of good points there too.
-
Hi.
Super weird client - that's for sure.
User-agent: * Disallow: /
Every bot will be blocked off! how in the world are they ranking?
watch that video, there are good ideas of bot and crawlers controlling. As well as you can consider that as best practices. And yes, what they have now is ridiculous.
https://moz.com/community/q/should-we-use-google-s-crawl-delay-setting
Here is a q/a about crawler delays. As far as I know Google ignores delays anyway, plus there is nothing good about it anyway.
Hope this helps.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
SEO Best Practices regarding Robots.txt disallow
I cannot find hard and fast direction about the following issue: It looks like the Robots.txt file on my server has been set up to disallow "account" and "search" pages within my site, so I am receiving warnings from the Google Search console that URLs are being blocked by Robots.txt. (Disallow: /Account/ and Disallow: /?search=). Do you recommend unblocking these URLs? I'm getting a warning that over 18,000 Urls are blocked by robots.txt. ("Sitemap contains urls which are blocked by robots.txt"). Seems that I wouldn't want that many urls blocked. ? Thank you!!
Intermediate & Advanced SEO | | jamiegriz0 -
What is the best way to go about product comparison text?
Our website is in the midst of a massive content enrichment project - we're moving from mostly catalog content to optimized web content. Our catalog and copy teams are hoping to include more product comparisons on the web (e.g. "unlike composite basketballs, rubber one's are more X..."), which can certainly provide useful information to our shoppers! However, from an SEO standpoint, we seem to have confused search engines when doing this in the past (i.e. the example above is currently ranked for a "composite basketball" term, not a rubber one). So... What is the best way to provide useful product comparisons without confusing search engines?
Intermediate & Advanced SEO | | laurenf0 -
Best server-side sitemap generators
I've been looking into sitemap generators recently and have got a good knowledge of what creating a sitemap for a small website of below 500 URLs involves. I have successfully generated a sitemap for a very small site, but I’m trying to work out the best way of crawling a large site with millions of URLs. I’ve decided that the best way to crawl such a large number of URLs is to use a server side sitemap, but this is an area that doesn’t seem to be covered in detail on SEO blogs / forums. Could anyone recommend a good server side sitemap generator? What do you think of the automated offerings from Google and Bing? I’ve found a list of server side sitemap generators from Google, but I can’t see any way to choose between them. I realise that a lot will depend on the type of technologies we use server side, but I'm afraid that I don't know them at this time.
Intermediate & Advanced SEO | | RG_SEO0 -
Blocking out specific URLs with robots.txt
I've been trying to block out a few URLs using robots.txt, but I can't seem to get the specific one I'm trying to block. Here is an example. I'm trying to block something.com/cats but not block something.com/cats-and-dogs It seems if it setup my robots.txt as so.. Disallow: /cats It's blocking both urls. When I crawl the site with screaming flog, that Disallow is causing both urls to be blocked. How can I set up my robots.txt to specifically block /cats? I thought it was by doing it the way I was, but that doesn't seem to solve it. Any help is much appreciated, thanks in advance.
Intermediate & Advanced SEO | | Whebb0 -
Blocked from google
Hi, i used to get a lot of trafic from google but sudantly there was a problem with the website and it seams to be blocked. We are also in the middle of changing the root domain because we are making a new webpage, i have looked at the webmaster tools and corrected al the errors but the page is still not visible in google. I have also orderd a new crawl. Anyone have any trics? do i loose a lot when i move the domainname, or is this a good thing in this mater? The old one is smakenavitalia.no The new one is Marthecarrara.no Best regards Svein Økland
Intermediate & Advanced SEO | | sveinokl0 -
Best internal linking structure?
We are considering implementing a site-wide contextual linking structure. Does anyone have some good guidelines / blog posts on this topic? Our site is quite (over 1 million pages), so the contextual linking would be automated, but we need to define a set of rules. Basically, if we have a great page on 'healthy recipes,' should we make every instance of the word 'healthy recipes' link back to that page, or should we limit it to a certain number of pages?
Intermediate & Advanced SEO | | nicole.healthline0 -
Launching a new site with old, new and updated content: What’s best practice?
Hi all, We are launching a new site soon and I’d like your opinion on best practice related to its content. We will be retaining some pages and content (although the URLs might change a bit as I intend to replace under-scores with hyphens and remove .asp from some extensions in order to standardise a currently uneven URL structuring). I will also be adding a lot of new pages with new content, along with amend some pages and their content (and amend URLs again if need be), and a few pages are going to be done away with all together. Any advice from those who’ve done the same in the past as to how best to proceed? Does the URL rewriting sound OK to do in conjunction with adding and amending content? Cheers, Dave
Intermediate & Advanced SEO | | Martin_S0 -
What is the best approach to a keyword that has multiple abbreviations?
I have a site for which the primary keyword has multiple abbreviations. The site is for the computer game "Football Manager", each iteration is often referred to as FM2012, FM12 or Football Manager 2012, the first two can also be used with or without spaces inbetween. While this is only 3 keywords to target, it means that every key phrase such as "FM2012 Tactics", must also be targeted in 3 ways. Is there a recommended approach to make sure that all 3 are targeted? At present I use the full title "Football Manager" in the the title and try to use the shorter abbreviations in the page, I also make sure the title tags always have an alternative e.g FM2012 Tactics Two specific questions as well as general tips: Does the <abbr>HTML tag help very much?</abbr> Are results likely to differ much for searches for "FM 2012" and "FM2012" i.e. without the space.
Intermediate & Advanced SEO | | freezedriedmedia1