Large site with faceted navigation using rel=canonical, but Google still has issues
-
First off, I just wanted to mention I did post this on one other forum so I hope that is not completely against the rules here or anything. Just trying to get an idea from some of the pros at both sources. Hope this is received well. Now for the question.....
"Googlebot found an extremely high number of URLs on your site:"
Gotta love these messages in GWT. Anyway, I wanted to get some other opinions here so if anyone has experienced something similar or has any recommendations I would love to hear them.
First off, the site is very large and utilizes faceted navigation to help visitors sift through results. I have implemented rel=canonical for many months now to have each page url that is created based on the faceted nav filters, push back to the main category page. However, I still get these damn messages from Google every month or so saying that they found too many pages on the site. My main concern obviously is wasting crawler time on all these pages that I am trying to do what they ask in these instances and tell them to ignore and find the content on page x.
So at this point I am thinking about possibly using robots.txt file to handle these, but wanted to see what others around here thought before I dive into this arduous task. Plus I am a little ticked off that Google is not following a standard they helped bring to the table.
Thanks for those who take the time to respond in advance.
-
Yes that's a different situation. You're now talking about pagination, which quite rightly, canonicals to parent page is not to be used.
For faceted/filtered navigation it seems like canonical usage is indeed the right way to go about it, given Peter's experience just mentioned above, and the article you linked to that says, "...(in part because Google only indexes the content on the canonical page, so any content from the rest of the pages in the series would be ignored)."
-
As for my situation it worked out quite nicely, I just wasn't patient enough. After about 2 months the issue corrected itself for the most part and I was able to reduce about a million "waste" pages out of the index. This is a very large site so losing a million pages in a handful of categories helped me gain in a whole lot of other areas and spread the crawler around to more places that were important for us.
I also spent some time doing some restructuring of internal linking from some of our more authoritative pages that I believe also assisted with this, but in my case rel="canonical" worked out pretty nicely. Just took some time and patience.
-
I should actually add that Google doesn't condone using rel-canonical back to the main search page or page 1. They allow canonical to a "View All" or a complex mix of rel-canonical and rel=prev/next. If you use rel-canonical on too many non-identical pages, they could ignore it (although I don't often find that to be true).
Vanessa Fox just did a write-up on Google's approach:
http://searchengineland.com/implementing-pagination-attributes-correctly-for-google-114970
I have to be honest, though - I'm not a fan of Google's approach. It's incredibly complicated, easy to screw up, doesn't seem to work in all cases, and doesn't work on Bing. This is a very complex issue and really depends on the site in question. Adam Audette did a good write-up:
http://searchengineland.com/five-step-strategy-for-solving-seo-pagination-problems-95494
-
Thanks Dr Pete,
Yes I've used meta no-index on pages that are simply not useful in any way shape or form for Google to find.
I would be hesitant noindexing my filters in question, but it sounds promising that you are backing the canonical approach and there is a latency on reporting. Our PA and DA is extremely high and we get crawled daily, so curious about your measurement tip (inurl) which is a good one!
Many thanks.
Simon
-
I'm working on a couple of cases now, and it is extremely tricky. Google often doesn't re-crawl/re-cache deeper pages for weeks or months, so getting the canonical to work can be a long process. Still, it is generally a very effective tag and can happen quickly.
I agree with others that Robots.txt isn't a good bet. It also tends to work badly with pages that are already indexed. It's good for keeping things out of the index (especially whole folders, for example), but once 1000s of pages are indexed, Robots.txt often won't clean them up.
Another option is META NOINDEX, but it depends on the nature of the facets.
A couple of things to check:
(1) Using site: with inurl:, monitor the faceted navigation pages in the Google index. Are the numbers gradually dropping? That's what you want to see - the GWT error may not update very often. Keep in mind that these numbers can be unreliable, so monitor them daily over a few weeks.
(2) Are there are other URLs you're missing? On a large, e-commerce site, it's entirely possibly this wasn't the only problem.
(3) Did you cut the crawl paths? A common problem is that people canonical, 301-redirect, or NOINDEX, but then nofollow or otherwise cut links to those duplicates. Sounds like a good idea, except that the canonical tag has to be crawled to work. I see this a lot, actually.
-
Did you find a solution for this? I have exactly the same issue and have implemented the rel canonical in exactly the same way.
The issue you are trying to address is improving crawl bandwidth/equity by not letting Google crawl these faceted pages.
I am thinking of Ajax loading in these pages to the parent category page and/or adding nofollow to the links. But the pages have already been indexed, so I wonder if nofollow will have any effect.
Have you had any progress? Any further ideas?
-
Because rel canonical does nothing more than give credit to teh chosen page and aviod duplicat content. it does not tell the SE to stop indexing or redirect. as far as finding the links it has no affect
-
thx
-
OK, sorry I was thinking too many pages, not links.
using no-index will not stop PR flowing, the search engine will still follow the links. -
Yeah that is why I am not real excited about using robots.txt or even a no index in this instance. They are not session ids, but more like:
www.example.com/catgeoryname/a,
www.example.com/catgeoryname/b
www.example.com/catgeoryname/c
etc
which would show all products that start with those letters. There are a lot of other filters too, such as color, size, etc, but the bottom line is I point all those back to just www.example.com/categoryname using rel canonical and am not understanding why it isn't working properly.
-
There are a large number of urls like this because of the way the faceted navigation works and I have considered no index, but somewhat concerned as we do get links to some of these urls and would like to maintain some of that link juice. The warning shows up in Google Webmaster tools when Googlebot finds a large number of urls. The rest of the message reads like this:
"Googlebot encountered extremely large numbers of links on your site. This may indicate a problem with your site's URL structure. Googlebot may unnecessarily be crawling a large number of distinct URLs that point to identical or similar content, or crawling parts of your site that are not intended to be crawled by Googlebot. As a result Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all of the content on your site."
rel canonical should fix this, but apparently it is not
-
Check how you are getting these pages.
Robots.txt is not an ideal solution. If Google finds pages in other places, still these pages will be crawled.
Normally print pages won't have link value and you may no index them.
If there are pages with session ids or campaign codes, use canonical if they have link value. Otherwise no index will be good.
-
the rel canonical with stop you getting duplicate content flags, but there is still a large number of pages its not going to hide them.
I have never seen this warning, how many pages are we talking about?, either it is very very high, or they are confusing the crawler.You may need to no index them
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
The evolution of Google's 'Quality' filters - Do thin product pages still need noindex?
I'm hoping that Mozzers can weigh in with any recent experiences with eCommerce SEO..... I like to assume (perhaps incorrectly) that Google's 'Quality' filters (formerly known as Panda) have evolved with some intelligence since Panda first launched and started penalising eCommerce sites for having thin product pages. On this basis i'd expect that the filters are now less heavy handed and know that product pages with no or little product description on them are still a quality user experience for people who want to buy that product. Therefore my question is this...
Algorithm Updates | | QubaSEO
Do thin product pages still need noindex given that more often that not they are a quality search result for those using a product specific search query? Has anyone experienced penalty recently (last 12 months) on an ecommerce site because of a high number of thin product pages?0 -
Canonical Tag on All Pages
This is a new one for me. I have a client that has a canonical tag on almost every page of their site. Even on pages that don't need it. For example on http://www.client.com/examplex they had code: Maybe I have missed something, but is there a reason for this? Does this hurt the ranking of the page?
Algorithm Updates | | smulto0 -
Google Hangout Video Takeover?
A while back I posted about a youtube video campaign that dominated the attorney rankings throughout Florida. Today, I noticed a new hangout video that does not have the reach of the before mentioned video, but it has just popped up as number three for the term "Tampa Car Accident Attorney." It wasn't even listed anywhere in the first few pages Monday. http://www.youtube.com/watch?v=barTgGYQTIM Has anyone else noticed Google Hangout Videos having this kind of success or is this a "flash in the pan" incident? Also, is there any significance to this even being a Google Hangout video as opposed to just a youtube video? Thanks, Ruben
Algorithm Updates | | KempRugeLawGroup0 -
VRL Parameters Question - Exclude? or use a Canonical Tag?
I'm trying to figure something out, as I just finished my "new look" to an old website. It uses a custom built shopping cart, and the system worked pretty well until about a year when ranking went down. My primary traffic used to come from top level Brand pages. Each brand gets sorted by the shopping cart and a Parameter extension is added... So customers can click Page 1 , Page 2 , Page 3 etc. So for example : http://www.xyz.com/brand.html , http://www.xyz.com/brand.html?page=1 , http://www.xyz.com/brand.html?page=2 and so on... The page= is dynamic, therefore the page title, meta's, etc are the same, however the products displayed are different. I don't want to exclude the parameter page= completely, as the products are different on each page and obviously I would want the products to be indexed. However, at the same time my concern is that have these parameters might be causing some confusion an hence why I noticed a drop in google rankings. I also want to note - with my market, its not needed to break these pages up to target more specific keywords. Maybe using something like this would be the appropriate measure?
Algorithm Updates | | Southbay_Carnivorous_Plants0 -
De-indexed homepage in Google - very confusing.
A website I provide content for has just suffered a de-indexed homepage in Google (not in any of the other search engines) - all the other pages remained indexed as usual. Client asked me what might be the problem and I just couldn't figure it out - no linkbuilding has ever been carried out so clean backlink profile, etc. I just resubmitted it and it's back in its usual place, and has maintained the rankings (and PR) it had before it disappeared a few days ago. I checked WMT and no warnings or issues there. Any idea why this might've happened?
Algorithm Updates | | McTaggart0 -
Should Your Keep Out Of Stock Item Active On Your Site ?
If you have sold out products that will never come back in stock. Should you remove the items and urls from your sitemap and site. Or should you keep them active with a sold out image. The purpose would be for search engines will think your site is larger due the products and amount of urls you have ?
Algorithm Updates | | TeamLogo0 -
Client's site dropped completely from Google - AGAIN! Please help...
ok guys - hoping someone out there can help... (kinda long, but wanted to be sure all the details were out there) Already had this happen once - even posted in here about it - http://www.seomoz.org/q/client-s-site-dropped-completely-for-all-keywords-but-not-brand-name-not-manual-penalty-help Guy was a brand new client, all we did was tweak title tags and add a bit of content to his site since most was generic boilerplate text... started on our KW research and competitor research... in just a week, from title tag and content tweaks alone, he went from ranking on page 4-5 to ranking on page 3-4... then as we sat down to really optimize his site... POOF - he was gone from the Googs... He only showed up in "site:" searches and for exact matches of his business name - everything else was gone. Posted in here and on WMT - had several people check it out, both local guys and people from here (thanks to John Doherty for trying!) - but no one could figure out any reason why it would have happened. We submitted a reconsideration request, explaining that we knew we hadn't violated any quality guidelines, that he had less than 10 backlinks so it couldn't be bad linking, and that we had hardly touched the site. They sent back a canned response a week later that said there was no manual penalty and that we should "check our content" - mysteriously, the site started to show back up in the SERPs that morning (we got the canned response in the afternoon) There WAS an issue with NAP mismatch on some citations, but we fixed that, and that shouldn't have contributed to complete disappearance anyway. SO - the site was back, and back at its page 3 or 4 position... we decided to leave it alone for a few days just to be sure we didn't do anything... and then just 6 days later, when we were sitting down to fully optimize the site - POOF - completely gone again. We do SEO for a lot of different car dealers all over the country, and i know our strategies work. Looking at the competition in his market, he should easily be ranked page 2 or 3 with the very minimal tweaking we did... AND, since we didn't change anything since he came back, it makes even less sense that he was visible for a week and then gone again. So, mozzers... Anybody got any ideas? I'm really at a loss here - it makes zero sense that he's completely gone, except for his biz name... if nothing else, he should be ranking for "used cars canton"... Definitely appreciate any help anyone can offer -
Algorithm Updates | | Greg_Gifford0 -
What to do with eCommerce site with color variations of the same product?
On our eCommerce website we sell products that each have about 20 color variations. When the site was built each color variation was added individually instead as a single product with a configurable color option. Would it be best to combine the different variations into a single product with a configurable drop down menu for color or to leave as is? I am worried the search engines see the individual product pages for each color as duplicate content. What are your thoughts on how Zappos handles color variations? On the category page they display each color variation as an individual product but when the product is clicked it goes to a single product page with the different configurable color variations. Thanks
Algorithm Updates | | jchosler1