Duplicate pages in Google index despite canonical tag and URL Parameter in GWMT
-
Good morning Moz...
This is a weird one. It seems to be a "bug" with Google, honest...
We migrated our site www.three-clearance.co.uk to a Drupal platform over the new year. The old site used URL-based tracking for heat map purposes, so for instance
www.three-clearance.co.uk/apple-phones.html
..could be reached via
www.three-clearance.co.uk/apple-phones.html?ref=menu or
www.three-clearance.co.uk/apple-phones.html?ref=sidebar and so on.
GWMT was told of the ref parameter and the canonical meta tag used to indicate our preference. As expected we encountered no duplicate content issues and everything was good.
This is the chain of events:
-
Site migrated to new platform following best practice, as far as I can attest to.
-
Only known issue was that the verification for both google analytics (meta tag) and GWMT (HTML file) didn't transfer as expected so between relaunch on the 22nd Dec and the fix on 2nd Jan we have no GA data, and presumably there was a period where GWMT became unverified.
-
URL structure and URIs were maintained 100% (which may be a problem, now)
-
Yesterday I discovered 200-ish 'duplicate meta titles' and 'duplicate meta descriptions' in GWMT. Uh oh, thought I. Expand the report out and the duplicates are in fact ?ref= versions of the same root URL. Double uh oh, thought I.
-
Run, not walk, to google and do some Fu:
http://is.gd/yJ3U24 (9 versions of the same page, in the index, the only variation being the ?ref= URI)
Checked BING and it has indexed each root URL once, as it should.
Situation now:
-
Site no longer uses ?ref= parameter, although of course there still exists some external backlinks that use it. This was intentional and happened when we migrated.
-
I 'reset' the URL parameter in GWMT yesterday, given that there's no "delete" option. The "URLs monitored" count went from 900 to 0, but today is at over 1,000 (another wtf moment)
I also resubmitted the XML sitemap and fetched 5 'hub' pages as Google, including the homepage and HTML site-map page.
- The ?ref= URls in the index have the disadvantage of actually working, given that we transferred the URL structure and of course the webserver just ignores the nonsense arguments and serves the page. So I assume Google assumes the pages still exist, and won't drop them from the index but will instead apply a dupe content penalty. Or maybe call us a spam farm. Who knows.
Options that occurred to me (other than maybe making our canonical tags bold or locating a Google bug submission form ) include
A) robots.txt-ing .?ref=. but to me this says "you can't see these pages", not "these pages don't exist", so isn't correct
B) Hand-removing the URLs from the index through a page removal request per indexed URL
C) Apply 301 to each indexed URL (hello BING dirty sitemap penalty)
D) Post on SEOMoz because I genuinely can't understand this.
Even if the gap in verification caused GWMT to forget that we had set ?ref= as a URL parameter, the parameter was no longer in use because the verification only went missing when we relaunched the site without this tracking. Google is seemingly 100% ignoring our canonical tags as well as the GWMT URL setting - I have no idea why and can't think of the best way to correct the situation.
Do you?
Edited To Add: As of this morning the "edit/reset" buttons have disappeared from GWMT URL Parameters page, along with the option to add a new one. There's no messages explaining why and of course the Google help page doesn't mention disappearing buttons (it doesn't even explain what 'reset' does, or why there's no 'remove' option).
-
-
GWT numbers sometimes ignore parameter handling, oddly, and can be hard to read. I'm only seeing about 40 indexed pages with "ref" in the URL, which hardly seems disastrous. One note - once the pages get indexed, for whatever reason, de-indexing can take weeks, even if you do everything correctly. Don't change tactics every couple of days, or you're only going to make this worse, long-term. I think canonicals are fine for this, and they should be effective. It just may take Google some time to re-crawl and dis-lodge the pages. You actually may want to create an XML sitemap (for Google only) that just contains the "ref=" pages Google has indexed. This can nudge them to re-crawl and honor the canonical. Otherwise, the pages could sit there forever. You could 301-redirect - it would be perfectly valid in this case, since those URLs have no value to visitors. I wouldn't worry about the Bing sitemaps - just don't include the "ref=" URLs in the Bing maps, and you'll be fine.
-
Monday morning, still the same, still no reset/add parameters buttons in GMWT any more, still not understanding why Google is being so stubborn about this.
3 identical pages in the index, Google ignoring both GWMT URL parameter and canonical meta tag.
Sigh.
-
Nope, nice clean site map that GWMT says provides the right number of URLs with no 404s and no ?ref= links.
It's like Google has always indexed these links separately but for some reason has decided to only show them now they no longer exist..
-
They arent in your xml sitemap are they? You probably generated a new one when you moved the site over... that could possibly be overriding the parameters... maybe... weird...
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Canonical tag not working
I have a weebly site and I put the canonical tag in the header code but the moz crawler still says that I'm missing the canonical tag. Any tips?
Technical SEO | | ctpolarbears0 -
Over 500 thin URLs indexed from dynamically created pages (for lightboxes)
I have a client who has a resources section. This section is primarily devoted to definitions of terms in the industry. These definitions appear in colored boxes that, when you click on them, turn into a lightbox with their own unique URL. Example URL: /resources/?resource=dlna The information for these lightboxes is pulled from a standard page: /resources/dlna. Both are indexed, resulting in over 500 indexed pages that are either a simple lightbox or a full page with very minimal content. My question is this: Should they be de-indexed? Another option I'm knocking around is working with the client to create Skyscraper pages, but this is obviously a massive undertaking given how many they have. Would appreciate your thoughts. Thanks.
Technical SEO | | Alces0 -
Google not indexing my website
Hi guys, We have this website http://www.m-health-expo.nl/ but it is not indexed by google. In webmaster tools google says that it can not fetch the site due to the robots.txt but i do not see any faults in it. http://www.m-health-expo.nl/robots.txt Do you see something strange, it really bothers me.
Technical SEO | | RuudHeijnen0 -
Page URL Change
We're planning on rolling out a redesign of an existing page, and at the same time, we're looking to possibly changing the URL of the page. Currently, the URL is www.blah.com/phraseword1-phraseword2-phraseword3-phraseword4 and we're ranking top 3 in Google SERP for that 4-word phrase. The keyword phrase is something we have in our Page Title, Site Copy and the URL. Now, we are planning on simplifying the URL to below.. www.blah.com/phraseword1-phraseword2 The plan is to 301 redirect the original URL to this new URL and actually work the exact phrase into the copy a few more times. My understanding is that URL doesn't get as much weight as it does in the past, but it's still important. So my question is... How important is the URL in this case where we will continue to have it in our page title and also we'll be working more copy on to the page with the appropriate keyword? Will 301 redirect from the old URL address the issue of passing SEO value for that keyword phrase? Thanks,
Technical SEO | | JoeLin
Joe0 -
Duplicate page content
Hello, The pro dashboard crawler bot thing that you get here reports the mydomain.com and mydomain.com/index.htm as duplicate pages. Is this a problem? If so how do I fix it? Thanks Ian
Technical SEO | | jwdl0 -
Why is my office page not being indexed?
Good Morning from 24 degrees C partly cloudy wetherby UK 🙂 This page is not being indexed by Google:
Technical SEO | | Nightwing
http://www.sandersonweatherall.co.uk/office-to-let-leeds/ 1st Question Ive checked robots txt file no problems, i'm in the midst of updating the xml sitemap (it had the old one in place). It only has one link from this page http://www.sandersonweatherall.co.uk/Site-Map/ So is the reason oits not being indexed just a simple case of lack if SEO juice from inbound links so the remedy lies in routing more inbound links to the offending page? 2nd question Is the quickest way to diagnose if a web address is not being indexed to cut and paste the url in the Google search box and if it doesnt return the page theres a problem? Thanks in advance, David0 -
Page not being indexed
Hi all, On our site we have a lot of bookmaker reviews, and we are ranking pretty good for most bookmaker names as keywords, however a single bookmaker seems to have been shunned by Google. For a search "betsafe" in Denmark, this page does not appear among the top 50: http://www.betxpert.com/bookmakere/betsafe All of our other review pages rank in top 10-20 for the bookmaker name as keyword. What to do if Google has "banned" a page? Best regards, Rasmus
Technical SEO | | rasmusbang0 -
High number of Duplicate Page titles and Content related to index.php
It appears that every page on our site (www.bridgewinners.com) also creates a version of itself with a suffix. This results in Seomoz indicating that there are thousands of duplicate titles and content. 1. Does this matter? If so, how much? 2. How do I eliminate this (we are using joomla)? Thanks.
Technical SEO | | jfeld2220