Robots.txt & Duplicate Content
-
In reviewing my crawl results I have 5666 pages of duplicate content. I believe this is because many of the indexed pages are just different ways to get to the same content. There is one primary culprit. It's a series of URL's related to CatalogSearch - for example; http://www.careerbags.com/catalogsearch/result/index/?q=Mobile
I have 10074 of those links indexed according to my MOZ crawl. Of those 5349 are tagged as duplicate content. Another 4725 are not.
Here are some additional sample links:
http://www.careerbags.com/catalogsearch/result/index/?dir=desc&order=relevance&p=2&q=Amy
http://www.careerbags.com/catalogsearch/result/index/?color=28&q=bellemonde
http://www.careerbags.com/catalogsearch/result/index/?cat=9&color=241&dir=asc&order=relevance&q=baggalliniAll of these links are just different ways of searching through our product catalog. My question is should we disallow - catalogsearch via the robots file? Are these links doing more harm than good?
-
For product pages, I would canonical the page with the most descriptive URL.
For category pages, I agree with you, I would noindex them.
I think I just answered my own question!!
-
Oke, the question concerning rel="canonical" is which URL becomes the canonical version? Since there is no page on the website which would be appropiate (as far as i've seen) i recommended the meta robots tag.
I do agree that rel="canonical" is the preferred option, but in this situation i can't see a way to implement it properly. Which page would you highlight as the canonical?
-
I agree entirely that "Search result pages are too varied to be included in the index".
That said, my understanding is that if you canonical a page, it doesn't get indexed. So we wouldn't have to worry about the appearance / user-friendliness of the URL. But (again, in my opinion) we should still worry about link equity being passed, and that won't happen if you noindex.
This gets complicated fast. I like your solution b/c it's a lot cleaner and easier to implement. Still not convinced it's the "best" way to go though.
-
Where is the evidence that these work? I have never seen them work. Google totally ignores the URL parameters tools in GWTs.
-
I do agree that a rel="canonical" is good option for the problem that's at hand.
As jeremy has stated however the link we are referring to in the href section redirects to the home page. http://www.careerbags.com/catalogsearch/result/index/In my original answer i did not test this. I assumed there would be a list of all products here not filtered by search results. Since this is not the case and this page in fact does not exist it's hard to point at a url to be canonical.
Therefor i changed my answer to include the robots meta tag. This would indeed remove the search pages from the search index. I do think this is a positive thing though.
Look at the following url: http://www.careerbags.com/catalogsearch/result/?q=rolling+laptop+bags
Not really the type of URL i would click on in the search results. The following URL however is something i would want to click on: http://www.careerbags.com/laptop-bags/women-s/rolling-laptop-bags.html
Search result pages are too varied to be included in the index to my opinion.
Hope you agree with this, if not then i would like to hear your thoughts on this.
-
Simon, Wesley, Michael...
These customer facing search result pages are the ones often bookmarked and shared by site visitors. How worried does one need to be about losing link equity? I realize every site is going to be different and social shares don't have link equity - at least for now - but this could add up over time. The rel canonical will enable capture of link equity whereas the robots noindex will not.
Am I over thinking this?
-
In this case you could add the meta robots tag on the search result pages like this:
content="noindex, follow">
Search results can indeed spawn an infinite amount of different URL's. This can be avoided by making sure they are not included in the index but are followed.
-
Webmaster guidelines specifically request that you prevent crawling of search results pages using a robots.txt file. The relevant section reads: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."
-
There are 2 distinct possible issues here
1. Search results are creating duplicate content
2. Search results are creating lots of thin content
You want to give the user every possibility of finding your products, but you don't want those search results indexed because you should already have your source product page indexed and aiming to rank well. If not see last paragraph.
I slightly misread your post and took the URLs to be purely filtered. You should add disallow /catalogsearch to your robots.txt and if any are indexed you can remove the directory in Webmaster Tools > Google Index > Remove URLs > Reason: Remove Directory. This from Google - http://www.mattcutts.com/blog/search-results-in-search-results/
If your site has any other parameters not in that directory you can add them in Webmaster Tools > Crawl > URL Parameters > Let Googlebot Decide. Google will understand they are not the main URLs and treat them accordingly.
As a side issue with your search results it would be a good idea to analyse them in Analytics. You might find you have a trend, maybe something searched for or not the perfect match for the returned result, where you can create new more targeted content.
-
I'm not sure this is the right approach. The catalog search is based on the search box on the website. The query parameter can be anything the customer enters. Are you suggesting that the backend code be modified to always return the in every result?
And why that page because that URL just redirects to the home page because there is no query parameter provided for the search.
In terms o losing link equity, how much equity do they have it they are duplicate content?
-
Hi Jeremy.
Yours is a common problem. The best way to deal with it is, as Wesley mentions, by putting canonical tags on all the duplicate pages - the one you want indexed and to show up in search results AND all the others that you can arrive at via catalog search or any other means of navigation.
Michael's suggestion will prevent the duplicate pages from getting indexed by Google. Unfortunately you lose any link equity going that route, so I'd suggest starting with canonical tags first.
-
To back up the detail Wesley gave you, you can also add URL parameters in Google Webmaster Tools
-
You could add a canonical tag to link to the default page. This way Google will know that it should only index that.
The code for this would be:This should be placed in the section of your HTML code.
Some more resources on the subject:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
User Intent - Office Chairs & Content Writing
Hi I'm trying to look for ideas for content on office chairs. Any ideas on where to start with user intent for this type of search query? I'm using answer the public to gather some ideas. Some are good ideas, but I can't actually find any search volume for the phrase so then I'm unsure whether to devote time to writing something. Surely most people want to just find a supplier, buy the chair & they don't want a huge informational piece how to buy a chair. Our competitors are the likes of amazon, and a load of other huge companies with high DA - so I'm looking at types of content we can write, people are interested in reading about chairs, which is less competitive..I'm not sure that exists... Any help is appreciated 🙂
Intermediate & Advanced SEO | | BeckyKey0 -
Best method for blocking a subdomain with duplicated content
Hello Moz Community Hoping somebody can assist. We have a subdomain, used by our CMS, which is being indexed by Google.
Intermediate & Advanced SEO | | KateWaite
http://www.naturalworldsafaris.com/
https://admin.naturalworldsafaris.com/ The page is the same so we can't add a no-index or no-follow.
I have both set up as separate properties in webmaster tools I understand the best method would be to update the robots.txt with a user disallow for the subdomain - but the robots text is only accessible on the main domain. http://www.naturalworldsafaris.com/robots.txt Will this work if we add the subdomain exclusion to this file? It means it won't be accessible on https://admin.naturalworldsafaris.com/robots.txt (where we can't create a file). Therefore won't be seen within that specific webmaster tools property. I've also asked the developer to add a password protection to the subdomain but this does not look possible. What approach would you recommend?0 -
Robots.txt issue for international websites
In Google.co.uk, our US based (abcd.com) is showing: A description for this result is not available because of this site's robots.txt – learn more But UK website (uk.abcd.com) is working properly. We would like to disappear .com result totally, if possible. How to fix it? Thanks in advance.
Intermediate & Advanced SEO | | JinnatUlHasan0 -
Does having a page that ends with ? cause duplicate content?
I am working on a site that has lots of dynamic parameters. So lets say we have www.example.com/page?parameter=1 When the page has no parameters you can still end up at www.example.com/page? Should I redirect this to www.example.com/page/ ? Im not sure if Google ignores this, or if these pages need to be dealt with. Thanks
Intermediate & Advanced SEO | | MarloSchneider0 -
PDF for link building - avoiding duplicate content
Hello, We've got an article that we're turning into a PDF. Both the article and the PDF will be on our site. This PDF is a good, thorough piece of content on how to choose a product. We're going to strip out all of the links to our in the article and create this PDF so that it will be good for people to reference and even print. Then we're going to do link building through outreach since people will find the article and PDF useful. My question is, how do I use rel="canonical" to make sure that the article and PDF aren't duplicate content? Thanks.
Intermediate & Advanced SEO | | BobGW0 -
Copying my Facebook content to website considered duplicate content?
I write career advice on Facebook on a daily basis. On my homepage users can see the most recent 4-5 feeds (using FB social media plugin). I am thinking to create a page on my website where visitors can see all my previous FB feeds. Would this be considered duplicate content if I copy paste the info, but if I use a Facebook social media plugin then it is not considered duplicate content? I am working on increasing content on my website and feel incorporating FB feeds would make sense. thank you
Intermediate & Advanced SEO | | knielsen0 -
Bi-Lingual Site: Lack of Translated Content & Duplicate Content
One of our clients has a blog with an English and Spanish version of every blog post. It's in WordPress and we're using the Q-Translate plugin. The problem is that my company is publishing blog posts in English only. The client is then responsible for having the piece translated, at which point we can add the translation to the blog. So the process is working like this: We add the post in English. We literally copy the exact same English content to the Spanish version, to serve as a placeholder until it's translated by the client. (*Question on this below) We give the Spanish page a placeholder title tag, so at least the title tags will not be duplicate in the mean time. We publish. Two pages go live with the exact same content and different title tags. A week or more later, we get the translated version of the post, and add that as the Spanish version, updating the content, links, and meta data. Our posts typically get indexed very quickly, so I'm worried that this is creating a duplicate content issue. What do you think? What we're noticing is that growth in search traffic is much flatter than it usually is after the first month of a new client blog. I'm looking for any suggestions and advice to make this process more successful for the client. *Would it be better to leave the Spanish page blank? Or add a sentence like: "This post is only available in English" with a link to the English version? Additionally, if you know of a relatively inexpensive but high-quality translation service that can turn these translations around quicker than my client can, I would love to hear about it. Thanks! David
Intermediate & Advanced SEO | | djreich0 -
What content should I block in wodpress with robots.txt?
I need to know if anyone has tips on creating a good robots.txt. I have read a lot of info, but I am just not clear on what I should allow and not allow on wordpress. For example there are pages and posts, then attachments, wp-admin, wp-content and so on. Does anyone have a good robots.txt guideline?
Intermediate & Advanced SEO | | ENSO0