Does Google see this as duplicate content?

94501

I'm working on a site that has too many pages in Google's index as shown in a simple count via a site search (example):

site:http://www.mozquestionexample.com

I ended up getting a full list of these pages and it shows pages that have been supposedly excluded from the index via GWT url parameters and/or canonicalization

For instance, the list of indexed pages shows:

1. http://www.mozquestionexample.com/cool-stuff

2. http://www.mozquestionexample.com/cool-stuff?page=2

3. http://www.mozquestionexample.com?page=3

4. http://www.mozquestionexample.com?mq_source=q-and-a

5. http://www.mozquestionexample.com?type=productss&sort=1date

Example #1 above is the one true page for search and the one that all the canonicals reference.

Examples #2 and #3 shouldn't be in the index because the canonical points to url #1.

Example #4 shouldn't be in the index, because it's just a source code that, again doesn't change the page and the canonical points to #1.

Example #5 shouldn't be in the index because it's excluded in parameters as not affecting page content and the canonical is in place.

Should I worry about these multiple urls for the same page and if so, what should I do about it?

Thanks... Darcy

Everett

Darcy,

Blocking URLs in the robots.txt file will not remove them from the index if Google has already found them, nor will it prevent them from being added if Google finds links to them, such as internal navigation links or external backlinks. If this is your issue, you'll probably see something like this in the SERPs for those pages:

"We cannot display the content because our crawlers are being blocked by this site's robots.txt file" or something like that.

Here's a good discussion about it on WMW.

If you have parameters set up in GWT and are using a rel canonical tag that points Google to the non-parameter version of the URL you probably don't need to block Googlebot. I would only block them if I thought crawlbudget was an issue, as in seeing Google to continue to crawl these pages within your log files, or when you potentially have millions of these types of pages.

94501

Hi Ray,

Thanks for the response. To answer your question, the URL parameters have been set for months, if not years.

I wouldn't know how to set a noindex on a url with a different source code, because it really isn't a whole new url, just different tracking. I'd be setting a noindex for the example 1 page and that would not be good.

So, should I just not worry about it then?

Thanks... Darcy

Ray-pp

Hi 94501,

Example #1 above is the one true page for search and the one that all the canonicals reference.

If the pages are properly canonicalized then Example #1 will receive nearly all of the authority stemming from pages with this URL as the canonical tag.

I.e. Example #2 and #3 will pass authority to Example #1

Examples #2 and #3 shouldn't be in the index because the canonical points to url #1.

Setting a canonical tag doesn't guarantee that a page will not be indexed. To do that, you'd need to add a 'noindex' tag to the page.

Google chooses whether or not to index these pages and in many situations you want them indexed. For example: User searches for 'product X' and product x resides on the 3rd page of your category. Since Google has this page indexed (although the canonical points to the main page) it makes sense to show the page that contains the product the user was searching for.

Example #4 shouldn't be in the index, because it's just a source code that, again doesn't change the page and the canonical points to #1.

To make sure it is not indexed, you would need to add a 'noidex' tag and/or make sure the parameters are set in GWMT to ignore these pages.

But again, if the canonical is set properly then the authority passes to the main page and having this page indexed may not have negative impact.

Example #5 shouldn't be in the index because it's excluded in parameters as not affecting page content and the canonical is in place.

How long ago was the parameter setting applied in GWMT? Sometimes it takes a couple weeks to deindex pages that were already indexed by Google.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Does Google see this as duplicate content?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Could duplicate (copied) content actually hurt a domain?

What should I do if same content ranked twice or more on Google?

How should I manage duplicate content caused by a guided navigation for my e-commerce site?

Partial duplicate content and canonical tags

Does Google bot read embedded content?

Duplicate content on subdomains.

PDF on financial site that duplicates ~50% of site content

Duplicate Content Warning For Pages That Do Not Exist