How do you block development servers with robots.txt?
-
When we create client websites the urls are client.oursite.com. Google is indexing theses sites and attaching to our domain. How can we stop it with robots.txt? I've heard you need to have the robots file on both the main site and the dev sites... A code sample would be groovy. Thanks, TR
-
Added X robots tag into our headers on our development sites.
Just a note - if you use apache and have mod_pagespeed installed , it wall conflict and pagespeed will remove the X robots tag.
Begin Bad Bot Blocking
BrowserMatchNoCase Googlebot bad_bot
BrowserMatchNoCase bingbot bad_bot
BrowserMatchNoCase OmniExplorer_Bot/6.11.1 bad_bot
BrowserMatchNoCase omniexplorer_bot bad_bot
BrowserMatchNoCase Baiduspider bad_bot
BrowserMatchNoCase Baiduspider/2.0 bad_bot
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase yandeximages bad_bot
BrowserMatchNoCase Spinn3r bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase Sogouwebspider/3.0 bad_bot
BrowserMatchNoCase Sogouwebspider/4.0 bad_bot
BrowserMatchNoCase sosospider+ bad_bot
BrowserMatchNoCase jikespider bad_bot
BrowserMatchNoCase ia_archiver bad_bot
BrowserMatchNoCase PaperLiBot bad_bot
BrowserMatchNoCase ahrefsbot bad_bot
BrowserMatchNoCase ahrefsbot/1.0 bad_bot
BrowserMatchNoCase SiteBot/0.1 bad_bot
BrowserMatchNoCase DNS-Digger/1.0 bad_bot
BrowserMatchNoCase DNS-Digger-Explorer/1.0 bad_bot
BrowserMatchNoCase boardreader bad_bot
BrowserMatchNoCase radian6 bad_bot
BrowserMatchNoCase R6_FeedFetcher bad_bot
BrowserMatchNoCase R6_CommentReader bad_bot
BrowserMatchNoCase ScoutJet bad_bot
BrowserMatchNoCase ezooms bad_bot
BrowserMatchNoCase CC-rget/5.818 bad_bot
BrowserMatchNoCase libwww-perl/5.813 bad_bot
BrowserMatchNoCase magpie-crawler 1.1 bad_bot
BrowserMatchNoCase jakarta bad_bot
BrowserMatchNoCase discobot/1.0 bad_bot
BrowserMatchNoCase MJ12bot bad_bot
BrowserMatchNoCase MJ12bot/v1.2.0 bad_bot
BrowserMatchNoCase MJ12bot/v1.2.5 bad_bot
BrowserMatchNoCase SemrushBot/0.9 bad_bot
BrowserMatchNoCase MLBot bad_bot
BrowserMatchNoCase butterfly bad_bot
BrowserMatchNoCase SeznamBot/3.0 bad_bot
BrowserMatchNoCase HuaweiSymantecSpider bad_bot
BrowserMatchNoCase Exabot/2.0 bad_bot
BrowserMatchNoCase netseer/0.1 bad_bot
BrowserMatchNoCase NetSeer crawler/2.0 bad_bot
BrowserMatchNoCase NetSeer/Nutch-0.9 bad_bot
BrowserMatchNoCase psbot/0.1 bad_bot
BrowserMatchNoCase Moreoverbot/x.00 bad_bot
BrowserMatchNoCase moreoverbot/5.0 bad_bot
BrowserMatchNoCase Jakarta Commons-HttpClient/3.0 bad_bot
BrowserMatchNoCase SocialSpider-Finder/0.2 bad_bot
BrowserMatchNoCase MaxPointCrawler/Nutch-1.1 bad_bot
BrowserMatchNoCase willow bad_bot
Order Deny,Allow
Deny from env=bad_botEnd Bad Bot Blocking
Header set X-Robots-Tag "noindex, nofollow"
Begin Bad Bot Blocking
BrowserMatchNoCase Googlebot bad_bot
BrowserMatchNoCase bingbot bad_bot
BrowserMatchNoCase OmniExplorer_Bot/6.11.1 bad_bot
BrowserMatchNoCase omniexplorer_bot bad_bot
BrowserMatchNoCase Baiduspider bad_bot
BrowserMatchNoCase Baiduspider/2.0 bad_bot
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase yandeximages bad_bot
BrowserMatchNoCase Spinn3r bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase Sogouwebspider/3.0 bad_bot
BrowserMatchNoCase Sogouwebspider/4.0 bad_bot
BrowserMatchNoCase sosospider+ bad_bot
BrowserMatchNoCase jikespider bad_bot
BrowserMatchNoCase ia_archiver bad_bot
BrowserMatchNoCase PaperLiBot bad_bot
BrowserMatchNoCase ahrefsbot bad_bot
BrowserMatchNoCase ahrefsbot/1.0 bad_bot
BrowserMatchNoCase SiteBot/0.1 bad_bot
BrowserMatchNoCase DNS-Digger/1.0 bad_bot
BrowserMatchNoCase DNS-Digger-Explorer/1.0 bad_bot
BrowserMatchNoCase boardreader bad_bot
BrowserMatchNoCase radian6 bad_bot
BrowserMatchNoCase R6_FeedFetcher bad_bot
BrowserMatchNoCase R6_CommentReader bad_bot
BrowserMatchNoCase ScoutJet bad_bot
BrowserMatchNoCase ezooms bad_bot
BrowserMatchNoCase CC-rget/5.818 bad_bot
BrowserMatchNoCase libwww-perl/5.813 bad_bot
BrowserMatchNoCase magpie-crawler 1.1 bad_bot
BrowserMatchNoCase jakarta bad_bot
BrowserMatchNoCase discobot/1.0 bad_bot
BrowserMatchNoCase MJ12bot bad_bot
BrowserMatchNoCase MJ12bot/v1.2.0 bad_bot
BrowserMatchNoCase MJ12bot/v1.2.5 bad_bot
BrowserMatchNoCase SemrushBot/0.9 bad_bot
BrowserMatchNoCase MLBot bad_bot
BrowserMatchNoCase butterfly bad_bot
BrowserMatchNoCase SeznamBot/3.0 bad_bot
BrowserMatchNoCase HuaweiSymantecSpider bad_bot
BrowserMatchNoCase Exabot/2.0 bad_bot
BrowserMatchNoCase netseer/0.1 bad_bot
BrowserMatchNoCase NetSeer crawler/2.0 bad_bot
BrowserMatchNoCase NetSeer/Nutch-0.9 bad_bot
BrowserMatchNoCase psbot/0.1 bad_bot
BrowserMatchNoCase Moreoverbot/x.00 bad_bot
BrowserMatchNoCase moreoverbot/5.0 bad_bot
BrowserMatchNoCase Jakarta Commons-HttpClient/3.0 bad_bot
BrowserMatchNoCase SocialSpider-Finder/0.2 bad_bot
BrowserMatchNoCase MaxPointCrawler/Nutch-1.1 bad_bot
BrowserMatchNoCase willow bad_bot
Order Deny,Allow
Deny from env=bad_botEnd Bad Bot Blocking
Header set X-Robots-Tag "noindex, nofollow"
-
On the root of the development subdomain, use the following robots.txt content to block all robots.
User-agent: *
Disallow: /Next, verify the subdomain in Google Webmaster Tools as its own site, and request that that site be removed from the index.
For added protection:
- Make the robots.txt on the live site read only, so when you copy the dev site over you don't accidentally copy over the robots.txt saying to exclude everything
- Set up a code monitor on the robots.txt for both the dev site and the live site that checks the content of those files and alerts you if there are changes. I use https://polepositionweb.com/roi/codemonitor/index.php.
-
Like Daniel said you can use robots.txt to block spiders, but this won't guarantee exclusion of URLs showing up in search results. You could use x-robots-tag in the server headers. Generate a 403 every time user-agent hits the sub domain.
-
I put a .htaccess style password on the development site. If you make a robots.txt to block the site, make sure you don't accidentally put that on the production site.
-
Unfortunately I don't have that option.
-
Just use a directory instead of a sub-domain and then block that directory... that's the easiest way.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What Schema would a Web design/development/seo ageny use and what is the schema.org link?
What Schema would a Web design/development/SEO Ageny use, and what is the schema.org link? I cannot for the life of me figure it out. ProfessionalService has been deprecated.
On-Page Optimization | | TiagoPedreira0 -
Meta Robots index & noindex Both Implemented on Website
I don't want few of the pages of website to get indexed by Google, thus I have implemented meta robots noindex code on those specific pages. Due to some complications I am not able to remove meta robots index from header of every page Now, on specific pages I have both codes 'index & noindex' implemented. Question is: Will Google crawl/index pages which have noindex code along with index code? Thanks!
On-Page Optimization | | Exa0 -
Website server errors
I launched a new website at www.cheaptubes.com and had recovered my search engine rankings as well after penguin & panda devestation. I'm was continuing to improve the site Sept 26th by adding caching of images and W3 cache but moz analytics is now saying I went from 288 medium issues to over 600 and i see the warning "45% of site pages served 302 redirects during the last crawl". I'm not sure how to fix this? I'm on WP using Yoast SEO so all the 301's I did are 301's not 302's. I do have SSL, could it be Http vs Https? I've asked this question before and two very nice people replied with suggestions which I tried to implement but couldn't, i got the WP white screen of death several times. They suggested the code below. Does anyone know how to implement this code or some other way to reduce the errors I'm getting? I've asked this at stackoverflow with no responses. "you have a lot of http & https issues so you should fix these with a bit of .htaccess code, as below. RewriteEngine On
On-Page Optimization | | cheaptubes
RewriteCond %{HTTPS} !=on
RewriteRule ^.*$ https://%{SERVER_NAME}%{REQUEST_URI} [R,L] You also have some non-www to www issues. You can fix these in .htaccess at the same time... RewriteCond %{HTTP_HOST} !^www.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L] You should find this fixes a lot of your issues. Also check in your Wordpress general settings that the site is set to www.cheaptubes.com for both instances." When I tried to do as they suggested it gave me an internal server error. Please see the code below from .htaccess and the server error. I took it out for now. BEGIN WordPress <ifmodule mod_rewrite.c="">RewriteEngine On
RewriteBase /
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
RewriteEngine On RewriteCond %{HTTPS} !=on RewriteRule ^.$ https://%{SERVER_NAME}%{REQUEST_URI} [R,L]
RewriteCond %{HTTP_HOST} !^www. RewriteRule ^(.)$ http://www.%{HTTP_HOST}/$1 [R=301,L]</ifmodule> END WordPress Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, webmaster@cheaptubes.com and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request.0 -
Need suggestion: Should the user profile link be disallowed in robots.txt
I maintain a myBB based forum here. The user profile links look something like this http://www.learnqtp.com/forums/User-Ankur Now in my GWT, I can see many 404 errors for user profile links. This is primarily because we have tight control over spam and auto-profiles generated by bots. Either our moderators or our spam control software delete such spammy member profiles on a periodic basis but by then Google indexes those profiles. I am wondering, would it be a good idea to disallow User profiles links using robots.txt? Something like Disallow: /forums/User-*
On-Page Optimization | | AnkurJ0 -
Blocked By Meta Robots
Hi I logged in the other day to find that over night I received 8347 notices saying certain pages are being kept out of the search engine indexes by meta-robots. I have not changed my robots.txt in years and I certainly didn't block Google from visiting those pages. Is this a fault on Roger Mozbot behalf? Or is there a bot preventing 8000+ pages being indexed? Is there a way to find out what meta-robot is doing this and where? And how I can get rid of it? I usually rank between #3 and #5 for the term 'sex toys' on google.com.au, but I now rank #7 to #9 so it would seem some of my pages/content is being blocked. My website is www.theloveshop.com THIS IS AN ADULT TOYS SITE. There is no porn videos or anything like that on it, but just in case you don't wish to look at sex toys or are around kids I thought I would mention it. Blake
On-Page Optimization | | wayne10 -
Using meta robots 'noindex'
Alright, so I would consider myself a beginner at SEO. I've been doing merchandising and marketing for Ecommerce sites for about a year and a half now and am just now starting to attempt to apply some intermediate SEO techniques to the sites I work on so bear with me. We are currently redoing the homepage of our site and I am evaluating what links to have on it. I don't want to lose precious link juice to pages that don't need it, but there are certain pages that we need to have on the homepage that people just won't search for. My question is would it be a good move to add the meta robots 'noindex' tag to these pages? Is my understanding correct that if the only link on the page is back to the homepage it will pass back the linkjuice? Also, how many homepage links are too many? We have a fairly large ecommerce site with a lot of categories we'd like to feature, but don't want to overdo the homepage. I appreciate any help!
On-Page Optimization | | ClaytonKendall0 -
How do we handle sitemaps in robots.txt when multiple domains point to same physical location?
we have www.mysite.net, www.mysite.se, www.mysite.fi and so on. all of these domains point to the same physical location on our webserver, and we replace texts given back to client depending on which domain he/she requested. My problem is this: How do i configure sitemaps in robots.txt when robots.txt is used by multiple domains? If I for instance put the rows Sitemap: http://www.mysite.net/sitemapNet.xml
On-Page Optimization | | nordicnetproducts
Sitemap: http://www.mysite.net/sitemapSe.xml in robots.txt, would that result in some cross submission error?0 -
Does Google respect User-agent rules in robots.txt?
We want to use an inline linking tool (LinkSmart) to cross link between a few key content types on our online news site. LinkSmart uses a bot to establish the linking. The issue: There are millions of pages on our site that we don't want LinkSmart to spider and process for cross linking. LinkSmart suggested setting a noindex tag on the pages we don't want them to process, and that we target the rule to their specific user agent. I have concerns. We don't want to inadvertently block search engine access to those millions of pages. I've seen googlebot ignore nofollow rules set at the page level. Does it ever arbitrarily obey rules that it's been directed to ignore? Can you quantify the level of risk in setting user-agent-specific nofollow tags on pages we want search engines to crawl, but that we want LinkSmart to ignore?
On-Page Optimization | | lzhao0