Robots.txt & Duplicate Content
-
In reviewing my crawl results I have 5666 pages of duplicate content. I believe this is because many of the indexed pages are just different ways to get to the same content. There is one primary culprit. It's a series of URL's related to CatalogSearch - for example; http://www.careerbags.com/catalogsearch/result/index/?q=Mobile
I have 10074 of those links indexed according to my MOZ crawl. Of those 5349 are tagged as duplicate content. Another 4725 are not.
Here are some additional sample links:
http://www.careerbags.com/catalogsearch/result/index/?dir=desc&order=relevance&p=2&q=Amy
http://www.careerbags.com/catalogsearch/result/index/?color=28&q=bellemonde
http://www.careerbags.com/catalogsearch/result/index/?cat=9&color=241&dir=asc&order=relevance&q=baggalliniAll of these links are just different ways of searching through our product catalog. My question is should we disallow - catalogsearch via the robots file? Are these links doing more harm than good?
-
For product pages, I would canonical the page with the most descriptive URL.
For category pages, I agree with you, I would noindex them.
I think I just answered my own question!!
-
Oke, the question concerning rel="canonical" is which URL becomes the canonical version? Since there is no page on the website which would be appropiate (as far as i've seen) i recommended the meta robots tag.
I do agree that rel="canonical" is the preferred option, but in this situation i can't see a way to implement it properly. Which page would you highlight as the canonical?
-
I agree entirely that "Search result pages are too varied to be included in the index".
That said, my understanding is that if you canonical a page, it doesn't get indexed. So we wouldn't have to worry about the appearance / user-friendliness of the URL. But (again, in my opinion) we should still worry about link equity being passed, and that won't happen if you noindex.
This gets complicated fast. I like your solution b/c it's a lot cleaner and easier to implement. Still not convinced it's the "best" way to go though.
-
Where is the evidence that these work? I have never seen them work. Google totally ignores the URL parameters tools in GWTs.
-
I do agree that a rel="canonical" is good option for the problem that's at hand.
As jeremy has stated however the link we are referring to in the href section redirects to the home page. http://www.careerbags.com/catalogsearch/result/index/In my original answer i did not test this. I assumed there would be a list of all products here not filtered by search results. Since this is not the case and this page in fact does not exist it's hard to point at a url to be canonical.
Therefor i changed my answer to include the robots meta tag. This would indeed remove the search pages from the search index. I do think this is a positive thing though.
Look at the following url: http://www.careerbags.com/catalogsearch/result/?q=rolling+laptop+bags
Not really the type of URL i would click on in the search results. The following URL however is something i would want to click on: http://www.careerbags.com/laptop-bags/women-s/rolling-laptop-bags.html
Search result pages are too varied to be included in the index to my opinion.
Hope you agree with this, if not then i would like to hear your thoughts on this.
-
Simon, Wesley, Michael...
These customer facing search result pages are the ones often bookmarked and shared by site visitors. How worried does one need to be about losing link equity? I realize every site is going to be different and social shares don't have link equity - at least for now - but this could add up over time. The rel canonical will enable capture of link equity whereas the robots noindex will not.
Am I over thinking this?
-
In this case you could add the meta robots tag on the search result pages like this:
content="noindex, follow">
Search results can indeed spawn an infinite amount of different URL's. This can be avoided by making sure they are not included in the index but are followed.
-
Webmaster guidelines specifically request that you prevent crawling of search results pages using a robots.txt file. The relevant section reads: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."
-
There are 2 distinct possible issues here
1. Search results are creating duplicate content
2. Search results are creating lots of thin content
You want to give the user every possibility of finding your products, but you don't want those search results indexed because you should already have your source product page indexed and aiming to rank well. If not see last paragraph.
I slightly misread your post and took the URLs to be purely filtered. You should add disallow /catalogsearch to your robots.txt and if any are indexed you can remove the directory in Webmaster Tools > Google Index > Remove URLs > Reason: Remove Directory. This from Google - http://www.mattcutts.com/blog/search-results-in-search-results/
If your site has any other parameters not in that directory you can add them in Webmaster Tools > Crawl > URL Parameters > Let Googlebot Decide. Google will understand they are not the main URLs and treat them accordingly.
As a side issue with your search results it would be a good idea to analyse them in Analytics. You might find you have a trend, maybe something searched for or not the perfect match for the returned result, where you can create new more targeted content.
-
I'm not sure this is the right approach. The catalog search is based on the search box on the website. The query parameter can be anything the customer enters. Are you suggesting that the backend code be modified to always return the in every result?
And why that page because that URL just redirects to the home page because there is no query parameter provided for the search.
In terms o losing link equity, how much equity do they have it they are duplicate content?
-
Hi Jeremy.
Yours is a common problem. The best way to deal with it is, as Wesley mentions, by putting canonical tags on all the duplicate pages - the one you want indexed and to show up in search results AND all the others that you can arrive at via catalog search or any other means of navigation.
Michael's suggestion will prevent the duplicate pages from getting indexed by Google. Unfortunately you lose any link equity going that route, so I'd suggest starting with canonical tags first.
-
To back up the detail Wesley gave you, you can also add URL parameters in Google Webmaster Tools
-
You could add a canonical tag to link to the default page. This way Google will know that it should only index that.
The code for this would be:This should be placed in the section of your HTML code.
Some more resources on the subject:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate page content on numerical blog pages?
Hello everyone, I'm still relatively new at SEO and am still trying my best to learn. However, I have this persistent issue. My site is on WordPress and all of my blog pages e.g page one, page two etc are all coming up as duplicate content. Here are some URL examples of what I mean: http://3mil.co.uk/insights-web-design-blog/page/3/ http://3mil.co.uk/insights-web-design-blog/page/4/ Does anyone have any ideas? I have already no indexed categories and tags so it is not them. Any help would be appreciated. Thanks.
Intermediate & Advanced SEO | | 3mil0 -
Moving some content to a new domain - best practices to avoid duplicate content?
Hi We are setting up a new domain to focus on a specific product and want to use some of the content from the original domain on the new site and remove it from the original. The content is appropriate for the new domain and will be irrelevant for the original domain and we want to avoid creating completely new content. There will be a link between the two domains. What is the best practice for this to avoid duplicate content and a potential Panda penalty?
Intermediate & Advanced SEO | | Citybase0 -
Duplicate (Basically) H1 & H2
We've about to relaunch one of our ecommerce sites and have a question regarding H1 & H2 tags. We use our primary keyword for each category in that category page's H1. We also include a block of text at the bottom of the page explaining the benefits of the products, the various styles we offer, personalization options, gift packaging, etc. We were planning on having an H2 at the beginning of that text that read 'About [keyword:]', but the question of duplicate H1 & H2 tags has come up. Is penalization possible for having them almost the same? It's not like they're not relevant - the H1 is referring to the category itself and the H2 references our explanation of the category. Just curious what the best way to approach this would be.
Intermediate & Advanced SEO | | Kingof50 -
Wordpress Duplicate Content
We have recently moved our company's blog to Wordpress on a subdomain (we utilize the Yoast SEO plugin). We are now experiencing an ever-growing volume of crawl errors (nearly 300 4xx now) for pages that do not exist to begin with. I believe it may have something to do with having the blog on a subdomain and/or our yoast seo plugin's indexation archives (author, category, etc) --- we currently have Subpages of archives and taxonomies, and category archives in use. I'm not as familiar with Wordpress and the Yoast SEO plugin as I am with other CMS' so any help in this matter would be greatly appreciated. I can PM further info if necessary. Thank you for the help in advance.
Intermediate & Advanced SEO | | BethA0 -
Duplicate Content - Panda Question
Question: Will duplicate informational content at the bottom of indexed pages violate the panda update? **Total Page Ratio: ** 1/50 of total pages will have duplicate content at the bottom off the page. For example...on 20 pages in 50 different instances there would be common information on the bottom of a page. (On a total of 1000 pages). Basically I just wanted to add informational data to help clients get a broader perspective on making a decision regarding "specific and unique" information that will be at the top of the page. Content ratio per page? : What percentage of duplicate content is allowed per page before you are dinged or penalized. Thank you, Utah Tiger
Intermediate & Advanced SEO | | Boodreaux0 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560 -
Duplicate content for area listings
Hi, I was slightly affected by the panda update on the 14th oct generaly dropping by about 5-8 spots in the serps for my main keywords, since then I've been giving my site a good looking over. On a site I've got city listings urls for certain widget companys, the thing is many areas and thus urls will have the same company listed. What would be the best way of solving this duplicate content as google may be seeing it? I was thinking of one page per company and prominenly listing the areas they operate so still hopefully get ranked for area searches. But i'd be losing the city names in the url as I've got them now for example: mywidgetsite.com/findmagicwidgets/new-york.html mywidgetsite.com/findmagicwidgets/atlanta.html Any ideas on how best to proceed? Cheers!
Intermediate & Advanced SEO | | NetGeek0 -
Duplicate Content Help
seomoz tool gives me back duplicate content on both these URL's http://www.mydomain.com/football-teams/ http://www.mydomain.com/football-teams/index.php I want to use http://www.mydomain.com/football-teams/ as this just look nice & clean. What would be best practice to fix this issue? Kind Regards Eddie
Intermediate & Advanced SEO | | Paul780