Robots.txt & Duplicate Content
-
In reviewing my crawl results I have 5666 pages of duplicate content. I believe this is because many of the indexed pages are just different ways to get to the same content. There is one primary culprit. It's a series of URL's related to CatalogSearch - for example; http://www.careerbags.com/catalogsearch/result/index/?q=Mobile
I have 10074 of those links indexed according to my MOZ crawl. Of those 5349 are tagged as duplicate content. Another 4725 are not.
Here are some additional sample links:
http://www.careerbags.com/catalogsearch/result/index/?dir=desc&order=relevance&p=2&q=Amy
http://www.careerbags.com/catalogsearch/result/index/?color=28&q=bellemonde
http://www.careerbags.com/catalogsearch/result/index/?cat=9&color=241&dir=asc&order=relevance&q=baggalliniAll of these links are just different ways of searching through our product catalog. My question is should we disallow - catalogsearch via the robots file? Are these links doing more harm than good?
-
For product pages, I would canonical the page with the most descriptive URL.
For category pages, I agree with you, I would noindex them.
I think I just answered my own question!!
-
Oke, the question concerning rel="canonical" is which URL becomes the canonical version? Since there is no page on the website which would be appropiate (as far as i've seen) i recommended the meta robots tag.
I do agree that rel="canonical" is the preferred option, but in this situation i can't see a way to implement it properly. Which page would you highlight as the canonical?
-
I agree entirely that "Search result pages are too varied to be included in the index".
That said, my understanding is that if you canonical a page, it doesn't get indexed. So we wouldn't have to worry about the appearance / user-friendliness of the URL. But (again, in my opinion) we should still worry about link equity being passed, and that won't happen if you noindex.
This gets complicated fast. I like your solution b/c it's a lot cleaner and easier to implement. Still not convinced it's the "best" way to go though.
-
Where is the evidence that these work? I have never seen them work. Google totally ignores the URL parameters tools in GWTs.
-
I do agree that a rel="canonical" is good option for the problem that's at hand.
As jeremy has stated however the link we are referring to in the href section redirects to the home page. http://www.careerbags.com/catalogsearch/result/index/In my original answer i did not test this. I assumed there would be a list of all products here not filtered by search results. Since this is not the case and this page in fact does not exist it's hard to point at a url to be canonical.
Therefor i changed my answer to include the robots meta tag. This would indeed remove the search pages from the search index. I do think this is a positive thing though.
Look at the following url: http://www.careerbags.com/catalogsearch/result/?q=rolling+laptop+bags
Not really the type of URL i would click on in the search results. The following URL however is something i would want to click on: http://www.careerbags.com/laptop-bags/women-s/rolling-laptop-bags.html
Search result pages are too varied to be included in the index to my opinion.
Hope you agree with this, if not then i would like to hear your thoughts on this.
-
Simon, Wesley, Michael...
These customer facing search result pages are the ones often bookmarked and shared by site visitors. How worried does one need to be about losing link equity? I realize every site is going to be different and social shares don't have link equity - at least for now - but this could add up over time. The rel canonical will enable capture of link equity whereas the robots noindex will not.
Am I over thinking this?
-
In this case you could add the meta robots tag on the search result pages like this:
content="noindex, follow">
Search results can indeed spawn an infinite amount of different URL's. This can be avoided by making sure they are not included in the index but are followed.
-
Webmaster guidelines specifically request that you prevent crawling of search results pages using a robots.txt file. The relevant section reads: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."
-
There are 2 distinct possible issues here
1. Search results are creating duplicate content
2. Search results are creating lots of thin content
You want to give the user every possibility of finding your products, but you don't want those search results indexed because you should already have your source product page indexed and aiming to rank well. If not see last paragraph.
I slightly misread your post and took the URLs to be purely filtered. You should add disallow /catalogsearch to your robots.txt and if any are indexed you can remove the directory in Webmaster Tools > Google Index > Remove URLs > Reason: Remove Directory. This from Google - http://www.mattcutts.com/blog/search-results-in-search-results/
If your site has any other parameters not in that directory you can add them in Webmaster Tools > Crawl > URL Parameters > Let Googlebot Decide. Google will understand they are not the main URLs and treat them accordingly.
As a side issue with your search results it would be a good idea to analyse them in Analytics. You might find you have a trend, maybe something searched for or not the perfect match for the returned result, where you can create new more targeted content.
-
I'm not sure this is the right approach. The catalog search is based on the search box on the website. The query parameter can be anything the customer enters. Are you suggesting that the backend code be modified to always return the in every result?
And why that page because that URL just redirects to the home page because there is no query parameter provided for the search.
In terms o losing link equity, how much equity do they have it they are duplicate content?
-
Hi Jeremy.
Yours is a common problem. The best way to deal with it is, as Wesley mentions, by putting canonical tags on all the duplicate pages - the one you want indexed and to show up in search results AND all the others that you can arrive at via catalog search or any other means of navigation.
Michael's suggestion will prevent the duplicate pages from getting indexed by Google. Unfortunately you lose any link equity going that route, so I'd suggest starting with canonical tags first.
-
To back up the detail Wesley gave you, you can also add URL parameters in Google Webmaster Tools
-
You could add a canonical tag to link to the default page. This way Google will know that it should only index that.
The code for this would be:This should be placed in the section of your HTML code.
Some more resources on the subject:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate content in Shopify - subsequent pages in collections
Hello everyone! I hope an expert in this community can help me verify the canonical codes I'll add to our store is correct. Currently, in our Shopify store, the subsequent pages in the collections are not indexed by Google, however the canonical URL on these pages aren't pointing to the main collection page (page 1), e.g. The canonical URL of page 2, page 3 etc are used as canonical URLs instead of the first page of the collections. I have the canonical codes attached below, it would be much appreciated if an expert can urgently verify these codes are good to use and will solve the above issues? Thanks so much for your kind help in advance!! -----------------CODES BELOW--------------- <title><br /> {{ page_title }}{% if current_tags %} – tagged "{{ current_tags | join: ', ' }}"{% endif %}{% if current_page != 1 %} – Page {{ current_page }}{% endif %}{% unless page_title contains shop.name %} – {{ shop.name }}{% endunless %}<br /></title>
Intermediate & Advanced SEO | | ycnetpro101
{% if page_description %} {% endif %} {% if current_page != 1 %} {% else %} {% endif %}
{% if template == 'collection' %}{% if collection %}
{% if current_page == 1 %} {% endif %}
{% if template == 'product' %}{% if product %} {% endif %}
{% if template == 'collection' %}{% if collection %} {% endif %}0 -
Do search engine consider this duplicate or thin content?
I operate an eCommerce site selling various equipment. We get product descriptions and various info from the manufacturer's websites offered to the dealers. Part of that info is in the form of User Guides and Operational Manuals downloaded in pdf format written by the manufacturer, then uploaded to our site. Also we embed and link to videos that are hosted on the manufacturer's respective YouTube or Vimeo channels. This is useful content for our customers.
Intermediate & Advanced SEO | | MichaelFactor
My questions are: Does this type of content help our site by offering useful info, or does it hurt our SEO due to it being thin and or duplicate content? Or does the original content publishers get all the benefit? Is there any benefit to us publishing this stuff? What exactly is considered "thin content"?0 -
SEO effect of content duplication across hub of sites
Hello, I have a question about a website I have been asked to work on. It is for a real estate company which is part of a larger company. Along with several other (rival) companies it has a website of property listings which receives a feed of properties from a central hub site - so lots of potential for page, title and meta content duplication (if if isn't already occuring) across the whole network of sites. In early investigation I don't see any of these sites ranking very well at all in Google for expected search phrases. Before I start working on things that might improve their rankings, I wanted to ask some questions from you guys: 1. How would such duplication (if it is occuring) effect the SEO rankings of such sites individually, or the whole network/hub collectively? 2. Is it possible to tell if such a site has been "burnt" for SEO purposes, especially if or from any duplication? 3. If such a site or the network has been totally burnt, are there any approaches or remedies that can be made to improve the site's SEO rankings significantly, or is the only/best option to start again from scratch with a brand new site, ensuring the use of new meta descriptions and unique content? Thanks in advance, Graham
Intermediate & Advanced SEO | | gmwhite9991 -
Duplicated Content with Index.php
Good Afternoon, My website uses Joomla CMS and has the htaccess rewrite code enabled to ensure the use of search engine friendly URLs (SEF's). While browsing the crawl diagnostics I have found that Moz considers the /index.php URL a duplicate to our root. I will always under the impression that the htaccess rewrite took care of that issue and obviously I would like to address it. I attempted to create a 301 redirect from the index.php URL to the root but ran into an issue when attempting to login to the admin portion of the website as the redirect sent me back to the homepage. I was curious if anyone had advice for handling the index.php duplication issue, specifically with Joomla. Additionally, I have confirmed that in Google Webmasters, under URL parameters, the index.php parameter is set as 'Representative URL'.
Intermediate & Advanced SEO | | BrandonEML0 -
Google WMT Showing Duplicate Content, But There is None
In the HTML improvements section of Google Webmaster Tools, it is showing duplicate content and I have verified that the duplicate content they are listing does not exist. I actually have another duplicate content issue I am baffled by, but that it already being discussed on another thread. These are the pages they are saying have duplicate META descriptions, http://www.hanneganremodeling.com/bathroom-remodeling.html (META from bathroom remodeling page) <meta name="<a class="attribute-value">description</a>" content="<a class="attribute-value">Bathroom Remodeling Washington DC, Bathroom Renovation Washington DC, Bath Remodel, Northern Virginia,DC, VA, Washington, Fairfax, Arlington, Virginia</a>" /> http://www.hanneganremodeling.com/estimate-request.html (META From estimate page) <meta name="<a class="attribute-value">description</a>" content="<a class="attribute-value">Free estimates basement remodeling, bathroom remodeling, home additions, renovations estimates, Washington DC area</a>" /> WlO9TLh
Intermediate & Advanced SEO | | WebbyNabler0 -
Proper Hosting Setup to Avoid Subfolders & Duplicate Content
I've noticed with hosting multiple websites on a single account you end up having your main site in the root public_html folder, but when you create subfolders for new website it actually creates a duplicate website: eg. http://kohnmeat.com/ is being hosted on laubeau.com's server. So you end up with a duplicate website: http://laubeau.com/kohn/ Anyone know the best way to prevent this from happening? (i.e. canonical? 301? robots.txt?) Also, maybe a specific 'how-to' if you're feeling generous 🙂
Intermediate & Advanced SEO | | ATMOSMarketing560 -
Http and https duplicate content?
Hello, This is a quick one or two. 🙂 If I have a page accessible on http and https count as duplicate content? What about external links pointing to my website to the http or https page. Regards, Cornel
Intermediate & Advanced SEO | | Cornel_Ilea0 -
Removing Duplicate Page Content
Since joining SEOMOZ four weeks ago I've been busy tweaking our site, a magento eCommerce store, and have successfully removed a significant portion of the errors. Now I need to remove/hide duplicate pages from the search engines and I'm wondering what is the best way to attack this? Can I solve this in one central location, or do I need to do something in the Google & Bing webmaster tools? Here is a list of duplicate content http://www.unitedbmwonline.com/?dir=asc&mode=grid&order=name http://www.unitedbmwonline.com/?dir=asc&mode=list&order=name
Intermediate & Advanced SEO | | SteveMaguire
http://www.unitedbmwonline.com/?dir=asc&order=name http://www.unitedbmwonline.com/?dir=desc&mode=grid&order=name http://www.unitedbmwonline.com/?dir=desc&mode=list&order=name http://www.unitedbmwonline.com/?dir=desc&order=name http://www.unitedbmwonline.com/?mode=grid http://www.unitedbmwonline.com/?mode=list Thanks in advance, Steve0