Robots.txt Blocking - Best Practices
-
Hi All,
We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line.
We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files?
Thanks!
User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
-
Thanks for taking the time to respond in depth, GreenStone. We appreciate the advice and have passed your response along to the web hosting company (along with a frustrated email) explaining they're not adhering to anyone's best practices. Hopefully this will convince them!
-
Thanks, Dmitrii for your response! From our research we've seen similar recommendations and it helps to have more evidence to back it up. Hopefully these guys will give in a bit!
-
Completely agree, I really wouldn't want to host my stuff with a company that can't figure out what really the best practices are ;-). This is very well layed out why you shouldn't want to set up your robots.txt like it is right now.
-
In general, I definitely wouldn't recommend the way the web-provider is handling this.
- Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
- It makes a lot more sense to just allow crawlers full access, and then add crawl delays for non google crawlers, in addition to disallowing those specific sub-folders: Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp.
- Googlebot Disallow: Crawl-delay: 5, does not do you any good, as google does not obey these commands. Only Search Console can control this.
- You can test what is visible to googlebot within search console's "robots" subsection, in order to verify what they can access.
- Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
-
Here is another video from Matt - https://www.youtube.com/watch?v=I2giR-WKUfY
Lots of good points there too.
-
Hi.
Super weird client - that's for sure.
User-agent: * Disallow: /
Every bot will be blocked off! how in the world are they ranking?
watch that video, there are good ideas of bot and crawlers controlling. As well as you can consider that as best practices. And yes, what they have now is ridiculous.
https://moz.com/community/q/should-we-use-google-s-crawl-delay-setting
Here is a q/a about crawler delays. As far as I know Google ignores delays anyway, plus there is nothing good about it anyway.
Hope this helps.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What Are Internal Linking Best Practices For Blogs?
We have a blog for our e-commerce site. We are posting about 4-5 blog posts a month, most of them 1500+ words. Within the content, we have around 10-20 links pointing out to other blog posts or products/categories on our site. Except for the products/categories, the links use non-optimized generic anchor text (i.e guide, sizing tips, planning resource). Are there any issues or problems as far as SEO with this practice? Thank You
Intermediate & Advanced SEO | | kekepeche0 -
Google robots.txt test - not picking up syntax errors?
I just ran a robots.txt file through "Google robots.txt Tester" as there was some unusual syntax in the file that didn't make any sense to me... e.g. /url/?*
Intermediate & Advanced SEO | | McTaggart
/url/?
/url/* and so on. I would use ? and not ? for example and what is ? for! - etc. Yet "Google robots.txt Tester" did not highlight the issues... I then fed the sitemap through http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php and that tool actually picked up my concerns. Can anybody explain why Google didn't - or perhaps it isn't supposed to pick up such errors? Thanks, Luke0 -
Search engine blocked by robots-crawl error by moz & GWT
Hello Everyone,. For My Site I am Getting Error Code 605: Page Banned by robots.txt, X-Robots-Tag HTTP Header, or Meta Robots Tag, Also google Webmaster Also not able to fetch my site, tajsigma.com is my site Any expert Can Help please, Thanx
Intermediate & Advanced SEO | | falguniinnovative0 -
Our Robots.txt and Reconsideration Request Journey and Success
We have asked a few questions related to this process on Moz and wanted to give a breakdown of our journey as it will likely be helpful to others! A couple of months ago, we updated our robots.txt file with several pages that we did not want to be indexed. At the time, we weren't checking WMT as regularly as we should have been and in a few weeks, we found that apparently one of the robots.txt files we were blocking was a dynamic file that led to the blocking of over 950,000 of our pages according to webmaster tools. Which page was causing this is still a mystery, but we quickly removed all of the entries. From research, most people say that things normalize in a few weeks, so we waited. A few weeks passed and things did not normalize. We searched, we asked and the number of "blocked" pages in WMT which had increased at a rate of a few hundred thousand a week were decreasing at a rate of a thousand a week. At this rate it would be a year or more before the pages were unblocked. This did not change. Two months later and we were still at 840,000 pages blocked. We posted on the Google Webmaster Forum and one of the mods there said that it would just take a long time to normalize. Very frustrating indeed considering how quickly the pages had been blocked. We found a few places on the interwebs that suggested that if you have an issue/mistake with robots.txt that you can submit a reconsideration request. This seemed to be our only hope. So, we put together a detailed reconsideration request asking for help with our blocked pages issue. A few days later, to our horror, we did not get a message offering help with our robots.txt problem. Instead, we received a message saying that we had received a penalty for inbound links that violate Google's terms of use. Major backfire. We used an SEO company years ago that posted a hundred or so blog posts for us. To our knowledge, the links didn't even exist anymore. They did.... So, we signed up for an account with removeem.com. We quickly found many of the links posted by the SEO firm as they were easily recognizable via the anchor text. We began the process of using removem to contact the owners of the blogs. To our surprise, we got a number of removals right away! Others we had to contact another time and many did not respond at all. Those we could not find an email for, we tried posting comments on the blog. Once we felt we had removed as many as possible, we added the rest to a disavow list and uploaded it using the disavow tool in WMT. Then we waited... A few days later, we already had a response. DENIED. In our request, we specifically asked that if the request were to be denied that Google provide some example links. When they denied our request, they sent us an email and including a sample link. It was an interesting example. We actually already had this blog in removem. The issue in this case was, our version was a domain name, i.e. www.domainname.com and the version google had was a wordpress sub domain, i.e. www.subdomain.wordpress.com. So, we went back to the drawing board. This time we signed up for majestic SEO and tied it in with removem. That added a few more links. We also had records from the old SEO company we were able to go through and locate a number of new links. We repeated the previous process, contacting site owners and keeping track of our progress. We also went through the "sample links" in WMT as best as we could (we have a lot of them) to try to pinpoint any other potentials. We removed what we could and again, disavowed the rest. A few days later, we had a message in WMT. DENIED AGAIN! This time it was very discouraging as it just didn't seem there were any more links to remove. The difference this time, was that there was NOT an email from Google. Only a message in WMT. So, while we didn't know if we would receive a response, we responded to the original email asking for more example links, so we could better understand what the issue was. Several days passed we received an email back saying that THE PENALTY HAD BEEN LIFTED! This was of course very good news and it appeared that our email to Google was reviewed and received well. So, the final hurdle was the reason that we originally contacted Google. Our robots.txt issue. We did not receive any information from Google related to the robots.txt issue we originally filed the reconsideration request for. We didn't know if it had just been ignored, or if there was something that might be done about it. So, as a last ditch final effort, we responded to the email once again and requested help as we did the other times with the robots.txt issue. The weekend passed and on Monday we checked WMT again. The number of blocked pages had dropped over the weekend from 840,000 to 440,000! Success! We are still waiting and hoping that number will continue downward back to zero. So, some thoughts: 1. Was our site manually penalized from the beginning, yet without a message in WMT? Or, when we filed the reconsideration request, did the reviewer take a closer look at our site, see the old paid links and add the penalty at that time? If the latter is the case then... 2. Did our reconsideration request backfire? Or, was it ultimately for the best? 3. When asking for reconsideration, make your requests known? If you want example links, ask for them. It never hurts to ask! If you want to be connected with Google via email, ask to be! 4. If you receive an email from Google, don't be afraid to respond to it. I wouldn't over do this or spam them. Keep it to the bare minimum and don't pester them, but if you have something pertinent to say that you have not already said, then don't be afraid to ask. Hopefully our journey might help others who have similar issues and feel free to ask any further questions. Thanks for reading! TheCraig
Intermediate & Advanced SEO | | TheCraig5 -
What is the best way to incorporate region-based keywords?
Greetings Mozzers, I am wanting to get the most "bang for my buck" in regards to region based keyword pages. If I am going after the keyword "Plumber" and the region "San Antonio", would it be best to: 1- Create a San Antonio Plumber page where we can target all critical factors for the region based keyword "San Antonio Plumber"
Intermediate & Advanced SEO | | MonsterWeb28
2- Link every instance of the term "San Antonio" and "San Antonio Plumber" throughout the site to the newly created "San Antonio Plumber" page. Thank you for any advice/clarification on this matter.0 -
Best url structure
I am making a new site for a company that services many cities. I was thinking a url structure like this, website.com/keyword1-keyword2-keyword3/cityname1-cityname2-cityname3-cityname4-cityname5. Will this be the best approach to optimize the site for the keyword plus 5 different cities ? as long as I keep the total url characters under the SeoMoz reccomended 115 characters ? Or would it be better to build separate pages for each city, trying to reword the main services to try to avoid dulpicate content.
Intermediate & Advanced SEO | | jlane90 -
URL blocked
Hi there, I have recently noticed that we have a link from an authoritative website, however when I looked at the code, it looked like this: <a <span="">href</a><a <span="">="http://www.mydomain.com/" title="blocked::http://www.mydomain.com/">keyword</a> You will notice that in the code there is 'blocked::' What is this? has it the same effect as a nofollow tag? Thanks for any help
Intermediate & Advanced SEO | | Paul780 -
Robots.txt disallow subdomain
Hi all, I have a development subdomain, which gets copied to the live domain. Because I don't want this dev domain to get crawled, I'd like to implement a robots.txt for this domain only. The problem is that I don't want this robots.txt to disallow the live domain. Is there a way to create a robots.txt for this development subdomain only? Thanks in advance!
Intermediate & Advanced SEO | | Partouter0