Does Google see this as duplicate content?
-
I'm working on a site that has too many pages in Google's index as shown in a simple count via a site search (example):
site:http://www.mozquestionexample.com
I ended up getting a full list of these pages and it shows pages that have been supposedly excluded from the index via GWT url parameters and/or canonicalization
For instance, the list of indexed pages shows:
1. http://www.mozquestionexample.com/cool-stuff
2. http://www.mozquestionexample.com/cool-stuff?page=2
3. http://www.mozquestionexample.com?page=3
4. http://www.mozquestionexample.com?mq_source=q-and-a
5. http://www.mozquestionexample.com?type=productss&sort=1date
Example #1 above is the one true page for search and the one that all the canonicals reference.
Examples #2 and #3 shouldn't be in the index because the canonical points to url #1.
Example #4 shouldn't be in the index, because it's just a source code that, again doesn't change the page and the canonical points to #1.
Example #5 shouldn't be in the index because it's excluded in parameters as not affecting page content and the canonical is in place.
Should I worry about these multiple urls for the same page and if so, what should I do about it?
Thanks... Darcy
-
Darcy,
Blocking URLs in the robots.txt file will not remove them from the index if Google has already found them, nor will it prevent them from being added if Google finds links to them, such as internal navigation links or external backlinks. If this is your issue, you'll probably see something like this in the SERPs for those pages:
"We cannot display the content because our crawlers are being blocked by this site's robots.txt file" or something like that.
Here's a good discussion about it on WMW.
If you have parameters set up in GWT and are using a rel canonical tag that points Google to the non-parameter version of the URL you probably don't need to block Googlebot. I would only block them if I thought crawlbudget was an issue, as in seeing Google to continue to crawl these pages within your log files, or when you potentially have millions of these types of pages.
-
Hi Ray,
Thanks for the response. To answer your question, the URL parameters have been set for months, if not years.
I wouldn't know how to set a noindex on a url with a different source code, because it really isn't a whole new url, just different tracking. I'd be setting a noindex for the example 1 page and that would not be good.
So, should I just not worry about it then?
Thanks... Darcy
-
Hi 94501,
Example #1 above is the one true page for search and the one that all the canonicals reference.
If the pages are properly canonicalized then Example #1 will receive nearly all of the authority stemming from pages with this URL as the canonical tag.
I.e. Example #2 and #3 will pass authority to Example #1
Examples #2 and #3 shouldn't be in the index because the canonical points to url #1.
Setting a canonical tag doesn't guarantee that a page will not be indexed. To do that, you'd need to add a 'noindex' tag to the page.
Google chooses whether or not to index these pages and in many situations you want them indexed. For example: User searches for 'product X' and product x resides on the 3rd page of your category. Since Google has this page indexed (although the canonical points to the main page) it makes sense to show the page that contains the product the user was searching for.
Example #4 shouldn't be in the index, because it's just a source code that, again doesn't change the page and the canonical points to #1.
To make sure it is not indexed, you would need to add a 'noidex' tag and/or make sure the parameters are set in GWMT to ignore these pages.
But again, if the canonical is set properly then the authority passes to the main page and having this page indexed may not have negative impact.
Example #5 shouldn't be in the index because it's excluded in parameters as not affecting page content and the canonical is in place.
How long ago was the parameter setting applied in GWMT? Sometimes it takes a couple weeks to deindex pages that were already indexed by Google.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Content Strategy/Duplicate Content Issue, rel=canonical question
Hi Mozzers: We have a client who regularly pays to have high-quality content produced for their company blog. When I say 'high quality' I mean 1000 - 2000 word posts written to a technical audience by a lawyer. We recently found out that, prior to the content going on their blog, they're shipping it off to two syndication sites, both of which slap rel=canonical on them. By the time the content makes it to the blog, it has probably appeared in two other places. What are some thoughts about how 'awful' a practice this is? Of course, I'm arguing to them that the ranking of the content on their blog is bound to be suffering and that, at least, they should post to their own site first and, if at all, only post to other sites several weeks out. Does anyone have deeper thinking about this?
Intermediate & Advanced SEO | | Daaveey0 -
How to avoid duplicate content with e-commerce and multiple stores?
We are currently developing an e-commerce platform that will feed multiple stores. Each store will have its own domain and URL, but all stores will offer products that come from the same centralized database. That means all products will have the same image, description and title across all stores. What would be the best practice to avoid getting stores penalized for duplicate content?
Intermediate & Advanced SEO | | Agence_Bunji0 -
Case Sensitive URLs, Duplicate Content & Link Rel Canonical
I have a site where URLs are case sensitive. In some cases the lowercase URL is being indexed and in others the mixed case URL is being indexed. This is leading to duplicate content issues on the site. The site is using link rel canonical to specify a preferred URL in some cases however there is no consistency whether the URLs are lowercase or mixed case. On some pages the link rel canonical tag points to the lowercase URL, on others it points to the mixed case URL. Ideally I'd like to update all link rel canonical tags and internal links throughout the site to use the lowercase URL however I'm apprehensive! My question is as follows: If I where to specify the lowercase URL across the site in addition to updating internal links to use lowercase URLs, could this have a negative impact where the mixed case URL is the one currently indexed? Hope this makes sense! Dave
Intermediate & Advanced SEO | | allianzireland0 -
Avoiding Duplicate Content - Same Product Under Different Categories
Hello, I am taking some past advise and cleaning up my content navigation to only show 7 tabs as opposed to the 14 currently showing at www.enchantingquotes.com. I am creating a "Shop by Room" and "Shop by Category" link, which is what my main competitors do. My concern is the duplicate content that will happen since the same item will appear in both categories. Should I no follow the "Shop by Room" page? I am confused as to when I should use a no follow as opposed to a canonical tag? Thank you so much for any advise!
Intermediate & Advanced SEO | | Lepasti0 -
Http and https duplicate content?
Hello, This is a quick one or two. 🙂 If I have a page accessible on http and https count as duplicate content? What about external links pointing to my website to the http or https page. Regards, Cornel
Intermediate & Advanced SEO | | Cornel_Ilea0 -
How To Handle Duplicate Content Regarding A Corp With Multiple Sites and Locations?
I have a client that has 800 locations. 50 of them are mine. The corporation has a standard website for their locations. The only thing different is their location info on each page. The majority of the content is the same for each website for each location. What can be done to minimize the impact/penalty of having "duplicate or near duplicate" content on their sites? Assuming corporate won't allow the pages to be altered.
Intermediate & Advanced SEO | | JChronicle0 -
Why duplicate content for same page?
Hi, My SEOMOZ crawl diagnostic warn me about duplicate content. However, to me the content is not duplicated. For instance it would give me something like: (URLs/Internal Links/External Links/Page Authority/Linking Root Domains) http://www.nuxeo.com/en/about/contact?utm_source=enews&utm_medium=email&utm_campaign=enews20110516 /1/1/31/2 http://www.nuxeo.com/en/about/contact?utm_source=enews&utm_medium=email&utm_campaign=enews20110711 0/0/1/0 http://www.nuxeo.com/en/about/contact?utm_source=enews&utm_medium=email&utm_campaign=enews20110811 0/0/1/0 http://www.nuxeo.com/en/about/contact?utm_source=enews&utm_medium=email&utm_campaign=enews20110911 0/0/1/0 Why is this seen as duplicate content when it is only URL with campaign tracking codes to the same content? Do I need to clean this?Thanks for answer
Intermediate & Advanced SEO | | nuxeo0 -
Google indexing flash content
Hi Would googles indexing of flash content count towards page content? for example I have over 7000 flash files, with 1 unique flash file per page followed by a short 2 paragraph snippet, would google count the flash as content towards the overall page? Because at the moment I've x-tagged the roberts with noindex, nofollow and no archive to prevent them from appearing in the search engines. I'm just wondering if the google bot visits and accesses the flash file it'll get the x-tag noindex, nofollow and then stop processing. I think this may be why the panda update also had an effect. thanks
Intermediate & Advanced SEO | | Flapjack0