Development Website Duplicate Content Issue
-
Hi,
We launched a client's website around 7th January 2013 (http://rollerbannerscheap.co.uk), we originally constructed the website on a development domain (http://dev.rollerbannerscheap.co.uk) which was active for around 6-8 months (the dev site was unblocked from search engines for the first 3-4 months, but then blocked again) before we migrated dev --> live.
In late Jan 2013 changed the robots.txt file to allow search engines to index the website. A week later I accidentally logged into the DEV website and also changed the robots.txt file to allow the search engines to index it.
This obviously caused a duplicate content issue as both sites were identical. I realised what I had done a couple of days later and blocked the dev site from the search engines with the robots.txt file.
Most of the pages from the dev site had been de-indexed from Google apart from 3, the home page (dev.rollerbannerscheap.co.uk, and two blog pages). The live site has 184 pages indexed in Google. So I thought the last 3 dev pages would disappear after a few weeks.
I checked back late February and the 3 dev site pages were still indexed in Google. I decided to 301 redirect the dev site to the live site to tell Google to rank the live site and to ignore the dev site content. I also checked the robots.txt file on the dev site and this was blocking search engines too. But still the dev site is being found in Google wherever the live site should be found.
When I do find the dev site in Google it displays this;
Roller Banners Cheap » admin
<cite>dev.rollerbannerscheap.co.uk/</cite><a id="srsl_0" class="pplsrsla" tabindex="0" data-ved="0CEQQ5hkwAA" data-url="http://dev.rollerbannerscheap.co.uk/" data-title="Roller Banners Cheap » admin" data-sli="srsl_0" data-ci="srslc_0" data-vli="srslcl_0" data-slg="webres"></a>A description for this result is not available because of this site's robots.txt – learn more.This is really affecting our clients SEO plan and we can't seem to remove the dev site or rank the live site in Google.Please can anyone help?
-
Glad that helped, Lewis.
Unfortunately, there's really no way to determine how long the 301-redirect process will take to get the URLs out of the SERPs. That's entirely up to the search engines and I've never seen much consistency to how long this takes for different cases.
One other thing you could do to try to help speed the process is to add an xml sitemap to the dev site, and verify it in both Webmaster Tools. (Only do this AFTER you have added the metarobots no-index tag to the remaining pages headers!) This will help remind the crawlers of the dev pages, and hopefully get the crawlers to visit them sooner, thereby noticing the redirects and individual no-indexes, and taking action on them sooner.
Personally, I'd let the process run for 2 or 3 weeks after the dev pages get re-indexed without the robots.txt. If the pages are gone, job done. If not, at that point I'd re-evaluate how much damage is being done by still having the dev site in the SERPs. If the damage is heavy, I'd be seriously tempted to use the URL Removal Tool in Bing & Google Webmaster Tools to get them out of the results so I could move on with building the authority of the primary domain (even though that would throw away the value the dev pages have built up).
REMEMBER! Once you've removed the robots.txt no-index, the metatitles and especially metadescriptions of the DEV site are what will, at least temporarily, be showing in the SERPs once the pages get re-indexed. So make certain they have been fully optimised as if they were the real site. That way at least in the near terms you'll still be attracting good traffic while waiting for the pages to hopefully drop out. This may allow even the dev pages to do well enough at bringing traffic that you can afford to wait until they drop out naturally.
**As far as seeing the additional 70 or so pages that are indexed, as Dan says, at the bottom of the search page is this paragraph and link:
_In order to show you the most relevant results, we have omitted some entries very similar to the 3 already displayed.
If you like, you can repeat the search with the omitted results included. _When you click on that link, you'll see the additional pages. This is called the supplemental index and usually means these pages aren't showing up very well in the results anyway. Which means that for most of them, it will sufficient to make sure you've added the metarobots no-index tag to their page headers to just get them removed from the index to avoid future problems.
Does all that make sense?
Paul
-
Thanks for the confirmation, Dan!
As for the process of verifying the subdomain in order to remove it using Webmaster Tools - I covered that as the last point in option 2
Paul
-
Hi Lewis
Be sure to register the dev subdomain as a separate website with webmaster tools. then do the URL removal from the dev subdomain site profile. I've seen this method work as quickly as a few days.
You can see the other pages in the index by selecting "repeat the search with the omitted results included".
-Dan
-
Wow thanks Paul, great and thorough answer!
The only thing I'll add - in terms of doing a URL removal for the subdomain;
-
you have to first verify the subdomain as a totally separate website in webmaster tools. WMT looks at all different subdomains, and even httpS as different website. so register that.
-
THEN you can remove the entire subdomain, using the wmt subdomain profile.
-Dan
-
-
Hi Paul,
Firstly i want to thank you for the great effort you have put into answering my question.
I have changed the robots txt file by going to Settings > Privacy > allow SERPs
Do you know how long this may take to remove the dev site from the search engines?
Also when I search site:dev.roller banners cheap . co . uk in google i only see 3 pages being indexed so unable to see the other 70?
Thanks
-
requires http i believe
-
I think the root of your problem comes from a common misconception about the robots.txt file, Lewis.
A robots.txt no-index directive is NOT designed to get pages removed from the search index. It simply tells the crawler: "when you encounter this directive, don't crawl any further". So the crawler never even gets a chance to discover whether there are any further pages, never mind whether they might be in the index already
THEREFORE! Any pages that are already in the index will simply stay there. (And if any outside sources have links to internal pages behind a robots.txt no-index directive, those linked pages' URLS will often be added to the search index anyway!) Any pages which are in the index this way will have their meta-descriptions blocked from displaying by the robots.txt directive, as you are seeing in your case.
Since a robots.txt no-index directive stops the crawler from looking any deeper, the engines are blocked from actually discovering the 301 redirects on your dev pages, and so aren't getting the cue to drop them in favour of the new pages! Hence the dev site stays in the index and shows up in SERPs. The human user does get the redirect so ends up on the new page, but you still have the duplicate content/competition problem.
NOTE: to actually tell the search engines not only to not index the page, but to remove it if it already exists, you must add a meta-no-index tag in the header of individual pages. The robots.txt no-index MUST NOT be in place in order for this tag to be discovered and obeyed. There is an automatic setting in WordPress Settings -> Reading page to disallow crawling which automatically adds the meta-no-index tag to each page's header
Unfortunately, the problem is bigger than you stated, as I'm finding almost 70 pages from the dev site indexed in the.co.uk SERPs
Here are what I see as your two main options, along with their ramifications:
1. Remove the robots.txt no-index directive and allow the 301 redirects to be crawled, eventually causing the dev pages to drop out of the SERPS
- this would be the preferred option if the existing dev site pages have actually started to acquire incoming links and ranking value, but you'd have no control over how long it would take for the competing dev pages to drop out of the index, meaning they will continue to interfere with your SEO until that process completes
- you'll need to check whether any of the other 70 pages in the results have incoming links and if so 301 redirect them as well
- you'll need to add meta-robots no-index tags to the header of each of the remaining non-redirected pages on the dev site to get them removed from the index.
**2. ** Use the URL Removal Tool in Google and Bing Webmaster Tools to have the dev site removed from the index
- likely the fastest way to get the competing URLs out of the indexes, but would mean that any acquired link authority from the dev pages would be lost, not transferred to the live site.
- would still require either the robots.txt no-index directive to stay in place, or better yet, remove it and replace it with meta-no index tags in the header of every page on the dev site.
- you'd need to remove the 301 redirects
- since the search engines consider subdomains completely separate sites, you'd need to set up and verify the dev subdomain as a separate site in both Google and Bing webmaster tools in order for the URL Removal Tool to work.
I've never actually used the URL Removal tool on a full subdomain before, but see no reason why it wouldn't work as expected. You could actually test it out first on your dev.birdybanners.co.uk/ site as it has the same problem of the dev site being indexed in the SERPs.
Hope that helps give you a strategy to resolve the problem? Be sure to holler if you need me to better clarify anything.
Paul
-
Hi Andy,
Thanks for your response.
When I visit remove URLs, I enter dev.rollerbannerscheap.co.uk but then it displays the URL as http://www.rollerbannerscheap.co.uk/dev.rollerbannerscheap.co.uk.
I want to remove a sub domain not a page, are you able to assist?
-
in GWT ensure you have removed the directory / subdomain from listings / index. (under optimisation > remove urls).
May take a week to kick in but if your 301s are working and robots is in place it will work.
In addition to these ensure you are using canonical tags pointing the the live location not dev.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Canonical Tags for Legacy Duplicate Content
I've got a lot of duplicate pages, especially products, and some are new but most have been like this for a long time; up to several years. Does it makes sense to use a canonical tag pointing to one master page for each product. Each page is slightly different with a different feature and includes maybe a sentence or two that is unique but everything else is the same.
Technical SEO | | AmberHanson0 -
US and UK Websites of Same Business with Same Content
Hello Community, I need your help to understand, whether I can use the US website's content on my UK website or not? US Website's domain: https://www.fortresssecuritystore.com UK Website's domain: https://www.fortresssecuritystore.co.uk Both websites are having same content on all the pages, including testimonials/reviews. I am trying to gain business from Adwords and Organic SEO marketing. Thanks.
Technical SEO | | CommercePundit1 -
How do I avoid this issue of duplicate content with Google?
I have an ecommerce website which sells a product that has many different variations based on a vehicle’s make, model, and year. Currently, we sell this product on one page “www.cargoliner.com/products.php?did=10001” and we show a modal to sort through each make, model, and year. This is important because based on the make, model, and year, we have different prices/configurations for each. For example, for the Jeep Wrangler and Jeep Cherokee, we might have different products: Ultimate Pet Liner - Jeep Wrangler 2011-2013 - $350 Ultimate Pet Liner - Jeep Wrangler 2014 - 2015 - $350 Utlimate Pet Liner - Jeep Cherokee 2011-2015 - $400 Although the typical consumer might think we have 1 product (the Ultimate Pet Liner), we look at these as many different types of products, each with a different configuration and different variants. We do NOT have unique content for each make, model, and year. We have the same content and images for each. When the customer selects their make, model, and year, we just search and replace the text to make it look like the make, model, and year. For example, when a custom selects 2015 Jeep Wrangler from the modal, we do a search and replace so the page will have the same url (www.cargoliner.com/products.php?did=10001) but the product title will say “2015 Jeep Wrangler”. Here’s my problem: We want all of these individual products to have their own unique urls (cargoliner.com/products/2015-jeep-wrangler) so we can reference them in emails to customers and ideally we start creating unique content for them. Our only problem is that there will be hundreds of them and they don’t have unique content other than us switching in the product title and change of variants. Also, we don’t want our url www.cargoliner.com/products.php?did=10001 to lose its link juice. Here’s my question(s): My assumption is that I should just keep my url: www.cargoliner.com/products.php?did=10001 and be able to sort through the products on that page. Then I should go ahead and make individual urls for each of these products (i.e. cargoliner.com/products/2015-jeep-wrangler) but just add a “nofollow noindex” to the page. Is this what I should do? How secure is a “no-follow noindex” on a webpage? Does Google still index? Am I at risk for duplicate content penalties? Thanks!
Technical SEO | | kirbyfike0 -
Duplicate Version of My Website
Hello Again, Looking for a little help to help me understand what exactly is going on here. Ive taken over maintenance of a website and have so far fixed a lot of issues. ahrefs has shown me that a second version of my companies website exists that exists at a second url. This second website is linked to the actual company website like I haven't seen before. www(dot)#(dot)co(dot)uk is the main company website. But a second accessible version exists and is accessible at www(dot)#(dot)co(dot)uk The instruments version is a direct copy and all of the links point directly to my main site. Any changes I make on the main version are automatically applied to the other version. It shows up as a SPAM back link on moz as all of the link points to my website etc Ideally in my mind, the instruments version homepage should simply re-direct to the main homepage to solve this "duplicate content and spammy backlink" issue however, the instruments version is the same suffix that all our company emails work with. Basically, HELP lol. I have no understanding of how this is set up, and the best way in which to deal and if it could affect anything such as company emails.
Technical SEO | | ATP0 -
Duplicate Content in Wordpress.com
Hi Mozers! I have a client with a blog on wordpress.com. http://newsfromtshirts.wordpress.com/ It just had a ranking drop because of a new Panda Update, and I know it's a Dupe Content problem. There are 3900 duplicate pages, basically because there is no use of noindex or canonical tag, so archives, categories pages are totally indexed by Google. If I could install my usual SEO plugin, that would be a piece of cake, but since Wordpress.com is a closed environment I can't. How can I put a noindex into all category, archive and author peges in wordpress.com? I think this could be done by writing a nice robot.txt, but I am not sure about the syntax I shoud use to achieve that. Thank you very much, DoMiSol Rossini
Technical SEO | | DoMiSoL0 -
What could be the cause of this duplicate content error?
I only have one index.htm and I'm seeing a duplicate content error. What could be causing this? IUJvfZE.png
Technical SEO | | ScottMcPherson1 -
Duplicate content due to csref
Hi, When i go trough my page, i can see that alot of my csref codes result in duplicate content, when SeoMoz run their analysis of my pages. Off course i get important knowledge through my csref codes, but im quite uncertain of how much it effects my SEO-results. Does anyone have any insights in this? Should i be more cautios to use csref-codes or dosent it create problems that are big enough for me to worry about them.
Technical SEO | | Petersen110 -
Duplicate Content issue
I have been asked to review an old website to an identify opportunities for increasing search engine traffic. Whilst reviewing the site I came across a strange loop. On each page there is a link to printer friendly version: http://www.websitename.co.uk/index.php?pageid=7&printfriendly=yes That page also has a link to a printer friendly version http://www.websitename.co.uk/index.php?pageid=7&printfriendly=yes&printfriendly=yes and so on and so on....... Some of these pages are being included in Google's index. I appreciate that this can't be a good thing, however, I am not 100% sure as to the extent to which it is a bad thing and the priority that should be given to getting it sorted. Just wandering what views people have on the issues this may cause?
Technical SEO | | CPLDistribution0