Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)
-
Hi
I take it if theres a staging or development area on a subdomain for a site, who's content is hence usually duplicate then this should not be indexable i.e. (no-indexed & nofollowed in metarobots) ? In order to prevent dupe content probs as well as non project related people seeing work in progress or finding accidentally in search engine listings ?
Also if theres no such info in meta robots is there any other way it may have been made non-indexable, or at least dupe content prob removed by canonicalising the page to the equivalent page on the live site ?
In the case in question i am finding it listed in serps when i search for the staging/dev area url, so i presume this needs urgent attention ?
Cheers
Dan
-
- use robots.txt vs the meta tags - robots.txt is preferred.
-
I'm about to issue these instructions would appreciate it if you could quickly confirm covers your advice correctly and nothing missing:
1) Setup a completely different GWT account unrelated to the main site, so that there is a new GWT account specific to the staging subdomain
2) Add a robots.txt on the staging area subdomain site that disallows all pages and all crawlers OR use the noindex meta tag on all pages. Its obviously very important when you update the main site it DOES NOTinclude or push out these files too (since that would result in main site or pages being de-indexed)3) Request removal of all pages in GWT. Leave the form blank for the page to be removed since this will remove the entire site4) After about 1 month (or you see that the pages are all out of the serps), and google has spidered and seen the robots.txt, then put up a password on the entire staging site.Note:For brand new sites staging areas that don't yet exist or exist but are new and not yet showing up in the index then simply add a password for human access to prevent the above process being required in the future. -
Thanks for clarifying that CleverPHD & thanks again for all your help and great advice
Have a great weekend !!
All Best
Dan
-
That is a completely valid question. This is why setting up the separate GWT account for the dev.domain.ext vs www.domain.ext. When you submit the removal request it will only be in the dev.domain.ext account.
The only thing you want to watch is that if you setup robots.txt in your dev environment you want to make sure that it does not get pushed out to your production server. That is the only gotcha as I see it.
-
thanks !
as er my last question theres no risk of accidentally taking out the main site as part of this process ?
cheers
dan
-
Thanks so much for that great advice
just a bit worried about accidentally getting main site removed by accident, i take it so long as its a brand new GWT account for that specific subdomain then this cant happen ?
Cheers
Dan
-
Here is a Google documentation on how to use the GWT to remove a page/directory/site and then the interaction with robots.txt
http://googlewebmastercentral.blogspot.com/2010/03/url-removal-explained-part-i-urls.html
"In order for a directory or site-wide removal to be successful, the directory or site must be disallowed in the site's robots.txt file."
Side story. I once had a subdomain that I needed to take out, but I could not modify the robots.txt file properly (long story). So, we used the GWT tool and the meta noindex tag. It still worked, but I think that would only be a backup approach to the one suggested by the documentation.
-
Usually, this would be true that you would need to use the noindex tag to get things out of the SERPs and need to leave the robots.txt "open" to the crawlers. But when you are working with the remove URL tool in GWT,they rx that you then put the site in robots.txt to keep them out of it
The removal tool in GWT takes care of Google taking the URLs out and then the robots.txt keeps the bots from coming back. Just a different sequence than if you were to use the noindex meta.
-
If you create the GWT account for the dev site and you submit for removal, GWT requires that you either a) have the site blocked in robots.tx or have a noindex meta tag on the pages. Otherwise they will just crawl you again later and you are back in the index. See my post from earlier.
-
Short answer - no dev sites should be public to start with to anyone (let along Google et alia). The simplest way is to put an htacess password on all your dev sites. You can do a password per person in your company, or just one general one that everyone on the dev team shares.
If you do have a dev site in the Serps, the simplest way to get it out is to setup a GWT account for that subdomain and then e.g. dev.yourdomain.ext and then go into that account and request removal of all pages. You just leave the form blank for the page to be removed and it takes out the whole site. You then need a robots.txt on dev.yourdomain.ext (different from the www. version) that disallows all pages all crawlers - that or use the noindex meta tag on all page.
After about 1 month (or you see that the pages are all out of the serps), then I would put up a password on that entire site and be done with it. Key point, dont put the password up until you let google try to spider and it sees the robots etc.
Also, if you have any other staging sites that are out there like test.yourdomain.ext etc. If they are not indexed, go ahead and put the password up on them to limit your exposure.
Public dev sites are the fastest way to get duplicate content into the index and to jack with the ranking of your current site. It is key that all of them are locked down. If one of your developers say it is no big deal, call BS, it is a big deal and it can cause a big mess.
-
Hey Dan,
In this case, I would not exclude crawling via robots.txt. Perhaps later after you have verified the URLs are out of the index.
Just because Google can't crawl a page, doesn't mean they won't keep it in the index. Excluding crawling will not get a page out of the index.
Add the NOINDEX, FOLLOW tag you listed above and give it some time.
Use GWT if it's urgent or the information is sensitive.
-
Thanks Anthony,
The staging area already exists and is indexable as far as i can tell
So i need to tell developers to exclude crawling via robots.txt, add a no-index tag to head of each page but keep it followed so still crawlable i.e. within the Head section of every page on the dev area
OR alternatively just remove urls from GWT)
If excluding crawling via robots.txt file then why do you need to add a noindex tag to each page too, surely the robots.txt deals with this situation ?
cheers
dan
-
Ideally when creating a new staging area, you'd want to exclude crawling via robots.txt.
Add the NoIndex tag to the head of your pages to get them removed from the SERPs. Make sure the page is still crawlable though, as if you exclude it in robots.txt first and then NoIndex it, Google won't be able to see the new NoIndex tag.
If there are not a lot of pages to remove, you can request page removal within Google Webmaster Tools.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My WP website got attack by malware & now my website site:www.example.ca shows about 43000 indexed page in google.
Hi All My wordpress website got attack by malware last week. It affected my index page in google badly. my typical site:example.ca shows about 130 indexed pages on google. Now it shows about 43000 indexed pages. I had my server company tech support scan my site and clean the malware yesterday. But it still shows the same number of indexed page on google. Does anybody had ever experience such situation and how did you fixed it. Looking for help. Thanks FILE HIT LIST:
Technical SEO | Jan 11, 2021, 4:43 PM | Chophel
{YARA}Spam_PHP_WPVCD_ContentInjection : /home/example/public_html/wp-includes/wp-tmp.php
{YARA}Backdoor_PHP_WPVCD_Deployer : /home/example/public_html/wp-includes/wp-vcd.php
{YARA}Backdoor_PHP_WPVCD_Deployer : /home/example/public_html/wp-content/themes/oceanwp.zip
{YARA}webshell_webshell_cnseay02_1 : /home/example2/public_html/content.php
{YARA}eval_post : /home/example2/public_html/wp-includes/63292236.php
{YARA}webshell_webshell_cnseay02_1 : /home/example3/public_html/content.php
{YARA}eval_post : /home/example4/public_html/wp-admin/28855846.php
{HEX}php.generic.malware.442 : /home/example5/public_html/wp-22.php
{HEX}php.generic.cav7.421 : /home/example5/public_html/SEUN.php
{HEX}php.generic.malware.442 : /home/example5/public_html/Webhook.php0 -
What do you do with product pages that are no longer used ? Delete/redirect to category/404 etc
We have a store with thousands of active items and thousands of sold items. Each product is unique so only one of each. All products are pinned and pushed online ... and then they sell and we have a product page for a sold item. All products are keyword researched and often can rank well for longtail keywords Would you :- 1. delete the page and let it 404 (we will get thousands) 2. See if the page has a decent PA, incoming links and traffic and if so redirect to a RELEVANT category page ? ~(again there will be thousands) 3. Re use the page for another product - for example a sold ruby ring gets replaces with ta new ruby ring and we use that same page /url for the new item. Gemma
Technical SEO | Feb 26, 2019, 3:42 PM | acsilver0 -
URL Structure On Site - Currently it's domain/product-name NOT domain/category/product name is this bad?
I have a eCommerce site and the site structure is domain/product-name rather than domain/product-category/product-name Do you think this will have a negative impact SEO Wise? I have seen that some of my individual product pages do get better rankings than my categories.
Technical SEO | Feb 1, 2016, 12:45 PM | the-gate-films0 -
Is it better to use XXX.com or XXX.com/index.html as canonical page
Is it better to use 301 redirects or canonical page? I suspect canonical is easier. The question is, which is the best canonical page, YYY.com or YYY.com/indexhtml? I assume YYY.com, since there will be many other pages such as YYY.com/info.html, YYY.com/services.html, etc.
Technical SEO | Jan 2, 2015, 7:27 PM | Nanook10 -
No index on subdomains
Hi, We have a subdomain that is appearing in the search results - I want to hide this as it looks really bad. If I were to add the no index tag to the sub domain would URL would this affect the whole domain or just that sub domain? The main domain is vitally important - it is just that sub domain I need to hide. Many thanks
Technical SEO | Mar 14, 2014, 12:53 AM | Creditsafe0 -
Home Page .index.htm and .com Duplicate Page Content/Title
I have been whittling away at the duplicate content on my clients' sites, thanks to SEOmoz's pro report, and have been getting push back from the account manager at register.com (the site was built here and the owner doesn't want to move it). He says these are the exact same page and he can't access one to redirect to the other. Any suggestions? The SEOmoz report says there is duplicate content on both these urls: Durango Mountain Biking | Durango Mountain Resort - Cascade Village http://www.cascadevillagehotel.com/index.htm Durango Mountain Biking | Durango Mountain Resort - Cascade Village http://www.cascadevillagehotel.com/ Your help is greatly appreciated! Sheryl
Technical SEO | Sep 18, 2012, 7:32 PM | TOMMarketingLtd.0 -
How does Google find /feed/ at the end of all pages on my site?
Hi! In Google Webmaster Tools I find *.../feed/ as a 404 page in crawl errors. The problem is that none of these pages exist and they have no inbound links (except the start page). FYI, it´s a wordpress site. Example: www.mysite.com/subpage1/feed/ www.mysite.com/subpage2/feed/ www.mysite.com/subpage3/feed/ etc Does Google search for /feed/ by default or why do I keep getting these 404´s every day?
Technical SEO | Jul 16, 2012, 11:56 AM | Vivamedia0 -
OK to block /js/ folder using robots.txt?
I know Matt Cutts suggestions we allow bots to crawl css and javascript folders (http://www.youtube.com/watch?v=PNEipHjsEPU) But what if you have lots and lots of JS and you dont want to waste precious crawl resources? Also, as we update and improve the javascript on our site, we iterate the version number ?v=1.1... 1.2... 1.3... etc. And the legacy versions show up in Google Webmaster Tools as 404s. For example: http://www.discoverafrica.com/js/global_functions.js?v=1.1
Technical SEO | Mar 16, 2012, 12:28 PM | AndreVanKets
http://www.discoverafrica.com/js/jquery.cookie.js?v=1.1
http://www.discoverafrica.com/js/global.js?v=1.2
http://www.discoverafrica.com/js/jquery.validate.min.js?v=1.1
http://www.discoverafrica.com/js/json2.js?v=1.1 Wouldn't it just be easier to prevent Googlebot from crawling the js folder altogether? Isn't that what robots.txt was made for? Just to be clear - we are NOT doing any sneaky redirects or other dodgy javascript hacks. We're just trying to power our content and UX elegantly with javascript. What do you guys say: Obey Matt? Or run the javascript gauntlet?0