Robots.txt Blocking - Best Practices

ReunionMarketing

Hi All,

We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line.

We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files?

Thanks!

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:
Crawl-delay: 5

User-agent: Yahoo-slurp
Disallow: 

User-agent: bingbot
Disallow:

User-agent: rogerbot
Disallow:

User-agent: *
Crawl-delay: 5
Disallow: /new_vehicle_detail.asp
Disallow: /new_vehicle_compare.asp
Disallow: /news_article.asp
Disallow: /new_model_detail_print.asp
Disallow: /used_bikes/
Disallow: /default.asp?page=xCompareModels
Disallow: /fiche_section_detail.asp

ReunionMarketing

Thanks for taking the time to respond in depth, GreenStone. We appreciate the advice and have passed your response along to the web hosting company (along with a frustrated email) explaining they're not adhering to anyone's best practices. Hopefully this will convince them!

ReunionMarketing

Thanks, Dmitrii for your response! From our research we've seen similar recommendations and it helps to have more evidence to back it up. Hopefully these guys will give in a bit!

Martijn_Scheijbeler

Completely agree, I really wouldn't want to host my stuff with a company that can't figure out what really the best practices are ;-). This is very well layed out why you shouldn't want to set up your robots.txt like it is right now.

GreenStone

In general, I definitely wouldn't recommend the way the web-provider is handling this.

Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
- It makes a lot more sense to just allow crawlers full access, and then add crawl delays for non google crawlers, in addition to disallowing those specific sub-folders: Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp.
Googlebot Disallow: Crawl-delay: 5, does not do you any good, as google does not obey these commands. Only Search Console can control this.
You can test what is visible to googlebot within search console's "robots" subsection, in order to verify what they can access.

DmitriiK

Here is another video from Matt - https://www.youtube.com/watch?v=I2giR-WKUfY

Lots of good points there too.

DmitriiK

Hi.

Super weird client - that's for sure.

User-agent: * Disallow: /

Every bot will be blocked off! how in the world are they ranking?

https://moz.com/blog/controlling-search-engine-crawlers-for-better-indexation-and-rankings-whiteboard-friday

watch that video, there are good ideas of bot and crawlers controlling. As well as you can consider that as best practices. And yes, what they have now is ridiculous.

https://moz.com/community/q/should-we-use-google-s-crawl-delay-setting

Here is a q/a about crawler delays. As far as I know Google ignores delays anyway, plus there is nothing good about it anyway.

Hope this helps.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt Blocking - Best Practices

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

What Are Internal Linking Best Practices For Blogs?

Google robots.txt test - not picking up syntax errors?

Search engine blocked by robots-crawl error by moz & GWT

Our Robots.txt and Reconsideration Request Journey and Success

What is the best way to incorporate region-based keywords?

Best url structure

URL blocked

Robots.txt disallow subdomain