Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Removing duplicate content
-
Due to URL changes and parameters on our ecommerce sites, we have a massive amount of duplicate pages indexed by google, sometimes up to 5 duplicate pages with different URLs.
1. We've instituted canonical tags site wide.
2. We are using the parameters function in Webmaster Tools.
3. We are using 301 redirects on all of the obsolete URLs
4. I have had many of the pages fetched so that Google can see and index the 301s and canonicals.
5. I created HTML sitemaps with the duplicate URLs, and had Google fetch and index the sitemap so that the dupes would get crawled and deindexed.
None of these seems to be terribly effective. Google is indexing pages with parameters in spite of the parameter (clicksource) being called out in GWT. Pages with obsolete URLs are indexed in spite of them having 301 redirects. Google also appears to be ignoring many of our canonical tags as well, despite the pages being identical.
Any ideas on how to clean up the mess?
-
Where this is appearing the most is on cross domain canonicals. We have duplicate content across 2 websites, and we've canonicaled some pages from Site A to Site B, and some from Site B to Site A. In theory, pages that were canonicaled to the other domain should be deindexed. When I run a rankings report, I see pages for the wrong domain ranking, a month later. They are pages with parameters, or old URLs that we've changed. It's like a game of whack a mole. Every time we get a page deindexed, a duplicate with a different parameter takes its place. And this is in spite of calling out these parameters in GWT.
What I imagine is happening is that we have several URLs for the same page indexed. When Google crawls our site, it is correctly canonicaling the page it crawls. In the rankings, however, Google is probably pulling a duplicate page out of its index, and ranking it without crawling it. If it was crawling it, Google would see the canonical tag, and not rank it. So we have an ongoing battle to get Google to crawl the page it just pulled out of its index to see the the canonical tag.
The reason for all this is that when a page cross domain canonicals correctly, the rankings for the duplicate page on the other site goes up dramatically. As long as Google keeps ranking the wrong pages, we don't get the rankings bump on the other site.
-
Are you basing this on a site: search? It's fairly common for URLs to appear in a site: search that otherwise will not appear for any actual searches. Are the undesirable versions of the URLs getting any search traffic?
-
Yes, as Patrick said, surprisingly often something like this is a result of a simple oversight because we have been looking at the same code over and over...
Do you have access to Screaming Frog? You could crawl your site and see whether redirects/canonicals are behaving as you expected.
Have you taken a look at the html of one of the incorrectly indexed pages when it is loaded in your browser? Can you see the canonical? If you try going to a redirected page, does it redirect? [I know--way to obvious, but sometimes it is good to start at the beginning again when we can't root out an issue.]
Another culprit in these cases can be internal links. Do you link internally using any of the undesirable URLs? That can send a message to Google that those URLs are still in play. Again, you can use Screaming Frog to find those strings.
-
It sounds like part of the problem may be the sitemaps you're sending. By including duplicates in a sitemap, you're basically telling Google that each version of the page is valid. I would remove them and resubmit a sitemap with only the canonical versions you want indexed and see if that helps.
-
Hi there
Are you sure you are using all of the tools above properly? Not saying you're not but people make mistakes and it's just something to look into.
When did you implement all of the changes? Was it recently or was it a long time ago?
How is your organic traffic and rankings? Did you check if you have a manual action at all?
Let me know - thanks!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate Content through 'Gclid'
Hello, We've had the known problem of duplicate content through the gclid parameter caused by Google Adwords. As per Google's recommendation - we added the canonical tag to every page on our site so when the bot came to each page they would go 'Ah-ha, this is the original page'. We also added the paramter to the URL parameters in Google Wemaster Tools. However, now it seems as though a canonical is automatically been given to these newly created gclid pages; below https://www.google.com.au/search?espv=2&q=site%3Awww.mypetwarehouse.com.au+inurl%3Agclid&oq=site%3A&gs_l=serp.3.0.35i39l2j0i67l4j0i10j0i67j0j0i131.58677.61871.0.63823.11.8.3.0.0.0.208.930.0j3j2.5.0....0...1c.1.64.serp..8.3.419.nUJod6dYZmI Therefore these new pages are now being indexed, causing duplicate content. Does anyone have any idea about what to do in this situation? Thanks, Stephen.
Intermediate & Advanced SEO | | MyPetWarehouse0 -
Avoiding Duplicate Content with Used Car Listings Database: Robots.txt vs Noindex vs Hash URLs (Help!)
Hi Guys, We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components: 1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
Intermediate & Advanced SEO | | browndoginteractive
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, in a dialog box on the Vehicle Listings Pages. Example functionality: http://screencast.com/t/kArKm4tBo The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. We do not want to index #2, the Vehicle Details pages, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and others draw from this same database, so we're worried about duplicate content. For instance, entering a snippet of dealer-provided content for one specific listing that Google indexed yielded 8,200+ results: Example Google query. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right. Now we have to determine the right solution to keep these pages out of the index: robots.txt, noindex meta tags, or hash (#) internal links. Robots.txt Advantages: Super easy to implement Conserves crawl budget for large sites Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages. Robots.txt Disadvantages: Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (will Google think we're pagerank sculpting?) Noindex Advantages: Does prevent vehicle details pages from being indexed Allows ALL pages to be crawled (advantage?) Noindex Disadvantages: Difficult to implement (vehicle details pages are served using ajax, so they have no tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables, similar to this stackoverflow solution. This means the plugin functionality is no longer self-contained, and some hosts may not allow these types of Apache rewrites (as I understand it) Forces (or rather allows) Googlebot to crawl hundreds of thousands of noindex pages. I say "force" because of the crawl budget required. Crawler could get stuck/lost in so many pages, and my not like crawling a site with 1,000,000,000 pages, 99.9% of which are noindexed. Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt Hash (#) URL Advantages: By using for links on Vehicle Listing pages to Vehicle Details pages (such as "Contact Seller" buttons), coupled with Javascript, crawler won't be able to follow/crawl these links. Best of both worlds: crawl budget isn't overtaxed by thousands of noindex pages, and internal links used to index robots.txt-disallowed pages are gone. Accomplishes same thing as "nofollowing" these links, but without looking like pagerank sculpting (?) Does not require complex Apache stuff Hash (#) URL Disdvantages: Is Google suspicious of sites with (some) internal links structured like this, since they can't crawl/follow them? Initially, we implemented robots.txt--the "sledgehammer solution." We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing many of these pages anyway, probably based on internal links pointing to them. We could nofollow the links pointing to these pages, but we don't want it to look like we're pagerank sculpting or something like that. If we implement noindex on these pages (and doing so is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO. My developers are pushing for the third solution: using the hash URLs. This works on all hosts and keeps all functionality in the plugin self-contained (unlike noindex), and conserves crawl budget while keeping vehicle details page out of the index (unlike robots.txt). But I don't want Google to slap us 6-12 months from now because it doesn't like links like these (). Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles, circles, circles on this for a couple of days now. Also, I can provide a test site URL if you'd like to see the functionality in action.0 -
Problems with ecommerce filters causing duplicate content.
We have an ecommerce website with 700 pages. Due to the implementation of filters, we are seeing upto 11,000 pages being indexed where the filter tag is apphended to the URL. This is causing duplicate content issues across the site. We tried adding "nofollow" to all the filters, we have also tried adding canonical tags, which it seems are being ignored. So how can we fix this? We are now toying with 2 other ideas to fix this issue; adding "no index" to all filtered pages making the filters uncrawble using javascript Has anyone else encountered this issue? If so what did you do to combat this and was it successful?
Intermediate & Advanced SEO | | Silkstream0 -
Best practice for duplicate website content: same root domain name but different extension
Hi there I have a new client who has two websites: http://www.bayofislandsteambuilding.co.nz
Intermediate & Advanced SEO | | turnbullholdingsltd
http://www.bayofislandsteambuilding.org.nz They are the same in every regard apart from the domain extension (.co.nz & .org.nz) which is likely to be causing them issues with Google ranking given the huge amount of duplicate content. What is the best practice approach to fixing this? Normally, if I was starting from scratch, I would set one of the extensions as an alias which redirects to the main domain. Thanks in advance. Laurie0 -
How do I geo-target continents & avoid duplicate content?
Hi everyone, We have a website which will have content tailored for a few locations: USA: www.site.com
Intermediate & Advanced SEO | | AxialDev
Europe EN: www.site.com/eu
Canada FR: www.site.com/fr-ca Link hreflang and the GWT option are designed for countries. I expect a fair amount of duplicate content; the only differences will be in product selection and prices. What are my options to tell Google that it should serve www.site.com/eu in Europe instead of www.site.com? We are not targeting a particular country on that continent. Thanks!0 -
PDF for link building - avoiding duplicate content
Hello, We've got an article that we're turning into a PDF. Both the article and the PDF will be on our site. This PDF is a good, thorough piece of content on how to choose a product. We're going to strip out all of the links to our in the article and create this PDF so that it will be good for people to reference and even print. Then we're going to do link building through outreach since people will find the article and PDF useful. My question is, how do I use rel="canonical" to make sure that the article and PDF aren't duplicate content? Thanks.
Intermediate & Advanced SEO | | BobGW0 -
How to Remove Joomla Canonical and Duplicate Page Content
I've attempted to follow advice from the Q&A section. Currently on the site www.cherrycreekspine.com, I've edited the .htaccess file to help with 301s - all pages redirect to www.cherrycreekspine.com. Secondly, I'd added the canonical statement in the header of the web pages. I have cut the Duplicate Page Content in half ... now I have a remaining 40 pages to fix up. This is my practice site to try and understand what SEOmoz can do for me. I've looked at some of your videos on Youtube ... I feel like I'm scrambling around to the Q&A and the internet to understand this product. I'm reading the beginners guide.... any other resources would be helpful.
Intermediate & Advanced SEO | | deskstudio0 -
Could you use a robots.txt file to disalow a duplicate content page from being crawled?
A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution?
Intermediate & Advanced SEO | | gregelwell0