Large site with faceted navigation using rel=canonical, but Google still has issues
-
First off, I just wanted to mention I did post this on one other forum so I hope that is not completely against the rules here or anything. Just trying to get an idea from some of the pros at both sources. Hope this is received well. Now for the question.....
"Googlebot found an extremely high number of URLs on your site:"
Gotta love these messages in GWT. Anyway, I wanted to get some other opinions here so if anyone has experienced something similar or has any recommendations I would love to hear them.
First off, the site is very large and utilizes faceted navigation to help visitors sift through results. I have implemented rel=canonical for many months now to have each page url that is created based on the faceted nav filters, push back to the main category page. However, I still get these damn messages from Google every month or so saying that they found too many pages on the site. My main concern obviously is wasting crawler time on all these pages that I am trying to do what they ask in these instances and tell them to ignore and find the content on page x.
So at this point I am thinking about possibly using robots.txt file to handle these, but wanted to see what others around here thought before I dive into this arduous task. Plus I am a little ticked off that Google is not following a standard they helped bring to the table.
Thanks for those who take the time to respond in advance.
-
Yes that's a different situation. You're now talking about pagination, which quite rightly, canonicals to parent page is not to be used.
For faceted/filtered navigation it seems like canonical usage is indeed the right way to go about it, given Peter's experience just mentioned above, and the article you linked to that says, "...(in part because Google only indexes the content on the canonical page, so any content from the rest of the pages in the series would be ignored)."
-
As for my situation it worked out quite nicely, I just wasn't patient enough. After about 2 months the issue corrected itself for the most part and I was able to reduce about a million "waste" pages out of the index. This is a very large site so losing a million pages in a handful of categories helped me gain in a whole lot of other areas and spread the crawler around to more places that were important for us.
I also spent some time doing some restructuring of internal linking from some of our more authoritative pages that I believe also assisted with this, but in my case rel="canonical" worked out pretty nicely. Just took some time and patience.
-
I should actually add that Google doesn't condone using rel-canonical back to the main search page or page 1. They allow canonical to a "View All" or a complex mix of rel-canonical and rel=prev/next. If you use rel-canonical on too many non-identical pages, they could ignore it (although I don't often find that to be true).
Vanessa Fox just did a write-up on Google's approach:
http://searchengineland.com/implementing-pagination-attributes-correctly-for-google-114970
I have to be honest, though - I'm not a fan of Google's approach. It's incredibly complicated, easy to screw up, doesn't seem to work in all cases, and doesn't work on Bing. This is a very complex issue and really depends on the site in question. Adam Audette did a good write-up:
http://searchengineland.com/five-step-strategy-for-solving-seo-pagination-problems-95494
-
Thanks Dr Pete,
Yes I've used meta no-index on pages that are simply not useful in any way shape or form for Google to find.
I would be hesitant noindexing my filters in question, but it sounds promising that you are backing the canonical approach and there is a latency on reporting. Our PA and DA is extremely high and we get crawled daily, so curious about your measurement tip (inurl) which is a good one!
Many thanks.
Simon
-
I'm working on a couple of cases now, and it is extremely tricky. Google often doesn't re-crawl/re-cache deeper pages for weeks or months, so getting the canonical to work can be a long process. Still, it is generally a very effective tag and can happen quickly.
I agree with others that Robots.txt isn't a good bet. It also tends to work badly with pages that are already indexed. It's good for keeping things out of the index (especially whole folders, for example), but once 1000s of pages are indexed, Robots.txt often won't clean them up.
Another option is META NOINDEX, but it depends on the nature of the facets.
A couple of things to check:
(1) Using site: with inurl:, monitor the faceted navigation pages in the Google index. Are the numbers gradually dropping? That's what you want to see - the GWT error may not update very often. Keep in mind that these numbers can be unreliable, so monitor them daily over a few weeks.
(2) Are there are other URLs you're missing? On a large, e-commerce site, it's entirely possibly this wasn't the only problem.
(3) Did you cut the crawl paths? A common problem is that people canonical, 301-redirect, or NOINDEX, but then nofollow or otherwise cut links to those duplicates. Sounds like a good idea, except that the canonical tag has to be crawled to work. I see this a lot, actually.
-
Did you find a solution for this? I have exactly the same issue and have implemented the rel canonical in exactly the same way.
The issue you are trying to address is improving crawl bandwidth/equity by not letting Google crawl these faceted pages.
I am thinking of Ajax loading in these pages to the parent category page and/or adding nofollow to the links. But the pages have already been indexed, so I wonder if nofollow will have any effect.
Have you had any progress? Any further ideas?
-
Because rel canonical does nothing more than give credit to teh chosen page and aviod duplicat content. it does not tell the SE to stop indexing or redirect. as far as finding the links it has no affect
-
thx
-
OK, sorry I was thinking too many pages, not links.
using no-index will not stop PR flowing, the search engine will still follow the links. -
Yeah that is why I am not real excited about using robots.txt or even a no index in this instance. They are not session ids, but more like:
www.example.com/catgeoryname/a,
www.example.com/catgeoryname/b
www.example.com/catgeoryname/c
etc
which would show all products that start with those letters. There are a lot of other filters too, such as color, size, etc, but the bottom line is I point all those back to just www.example.com/categoryname using rel canonical and am not understanding why it isn't working properly.
-
There are a large number of urls like this because of the way the faceted navigation works and I have considered no index, but somewhat concerned as we do get links to some of these urls and would like to maintain some of that link juice. The warning shows up in Google Webmaster tools when Googlebot finds a large number of urls. The rest of the message reads like this:
"Googlebot encountered extremely large numbers of links on your site. This may indicate a problem with your site's URL structure. Googlebot may unnecessarily be crawling a large number of distinct URLs that point to identical or similar content, or crawling parts of your site that are not intended to be crawled by Googlebot. As a result Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all of the content on your site."
rel canonical should fix this, but apparently it is not
-
Check how you are getting these pages.
Robots.txt is not an ideal solution. If Google finds pages in other places, still these pages will be crawled.
Normally print pages won't have link value and you may no index them.
If there are pages with session ids or campaign codes, use canonical if they have link value. Otherwise no index will be good.
-
the rel canonical with stop you getting duplicate content flags, but there is still a large number of pages its not going to hide them.
I have never seen this warning, how many pages are we talking about?, either it is very very high, or they are confusing the crawler.You may need to no index them
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Algo Updates
Hi I have noticed strange drops in our rankings, some last Monday 12th & some Sunday 18th - has anyone else experienced anything on these dates? I've seen talk of a possible algo update, but nothing concrete Thanks!
Algorithm Updates | | BeckyKey0 -
Where has Google found the £1.00 value for the penny black? Is it Google moving beyond the mark-ups too?
Hi guys, I am curious, so am wondering something about the Penny Black SERPs.
Algorithm Updates | | madcow78
Apparently Google shows a value of £1.00 Penny Black SERP From where does it come from? It's not the value Penny Black Value SERP The Wikipedia page hasn't any mark-up about it, actually it has the Price value mark-up of 1 penny Penny Black Wiki Markup Among the rare stamps, also the Inverted Jenny shows a value Inverted Jenny SERP But it's clearly taken from USPS and it's the cost of a new version of this rare stamp USPS Inverted Jenny Indeed, the mark-up matches that value USPS Inverted Jenny Mark-up I've been looking on-line for a new version of the Penny Black, but couldn't find anything.
The only small piece of information that I've found to correlate one pound with the Penny Black is on the Wikipedia page, but the point is: is Google able to strip those information from that piece? It's not a mark-up, it's not a number and mostly it's not a simple sentence like "The penny black cost was of £1.00" It reads "One full sheet cost 240 pennies or one pound sterling". Penny Black Wikipedia particular Is it Google moving beyond the mark-ups too? Thanks, Pierpaolo 9Cm3MOs.jpg f7XYNtF.jpg 5PpwapB.jpg hYUJswI.jpg 7kbIC4Q.jpg jnu1Gbe.jpg Wzltg0t.jpg2 -
Is it possible that Google may have erroneous indexing dates?
I am consulting someone for a problem related to copied content. Both sites in question are WordPress (self hosted) sites. The "good" site publishes a post. The "bad" site copies the post (without even removing all internal links to the "good" site) a few days after. On both websites it is obvious the publishing date of the posts, and it is clear that the "bad" site publishes the posts days later. The content thief doesn't even bother to fake the publishing date. The owner of the "good" site wants to have all the proofs needed before acting against the content thief. So I suggested him to also check in Google the dates the various pages were indexed using Search Tools -> Custom Range in order to have the indexing date displayed next to the search results. For all of the copied pages the indexing dates also prove the "bad" site published the content days after the "good" site, but there are 2 exceptions for the very 2 first posts copied. First post:
Algorithm Updates | | SorinaDascalu
On the "good" website it was published on 30 January 2013
On the "bad" website it was published on 26 February 2013
In Google search both show up indexed on 30 January 2013! Second post:
On the "good" website it was published on 20 March 2013
On the "bad" website it was published on 10 May 2013
In Google search both show up indexed on 20 March 2013! Is it possible to be an error in the date shown in Google search results? I also asked for help on Google Webmaster forums but there the discussion shifted to "who copied the content" and "file a DMCA complain". So I want to be sure my question is better understood here.
It is not about who published the content first or how to take down the copied content, I am just asking if anybody else noticed this strange thing with Google indexing dates. How is it possible for Google search results to display an indexing date previous to the date the article copy was published and exactly the same date that the original article was published and indexed?0 -
Traffic drop only affecting google country domains
Hello, I have noticed that our our traffic is down by 15% (last 30 days to the 30 days before it) and I dug deeper to figure out whats going on and I am not sure I understand what is happening. Traffic from google country domains( for example google.com.sa) dropped by 90% on the 18th of September, same applies to other country specific domains. Now my other stats (visits organic keywords, search queries in WMT) seem to be normal and have seem some decrease (~5%) but nothing as drastic as the traffic drop from the google country domains. Is this an https thing that is masking the source of the traffic that came into effect on that date? Is the traffic that is now missing from google country domains being reported from other sources? Can anyone shed some light on what is going on? qk0CS7X
Algorithm Updates | | omarfk0 -
SEO having different effects for different sites
Hi, I hope this isn't a dumb question, but I was asked by a local company to have a look at their website and make any suggestions on how to strengthen and improve their rankings. After time spent researching their competitors, and analysing their own website I was able to determine that they are actually in a good position. The have a well structured site that follows the basic search rules, they add new relevant content regularly and are working on their social strategy. Most of their pages are rated A within Moz, and they spend a lot of time tweaking the site. When I presented this to them, they asked why there are sites that rank above them that don't seem to take as much care over their website. For example, one of their main competitors doesn't engage in any social networking, and rarely adds content to their site. I was just wondering if anyone could shed any light on why this happens? I appreciate there's probably no simple answer, but it would be great to hear some different input. Many thanks
Algorithm Updates | | dantemple880 -
Does Google use data from Gmail to penalize domains and vice versa?
Has anyone noticed issues with Gmail deliverability and spam inboxing happening around the same time as other large Google updates? For example, if Google blasted your site in Panda or Penguin, have anyone seen them use the same judgement across into Gmail deliverability to blacklist your domain?
Algorithm Updates | | Eric_edvisors0 -
What Is The Deal Between Indeed and Google?
Anyone notice the love affair of Indeed and Google lately? Indeed is cannibalizing the top 30 SERPs for job related keywords. Seeing keywords where Indeed has 10-15 of the organic listings in the top 30. Compete.com is showing a +8% increase in search volume between in April and May. But it seems as if they really started to cannibalize the SERPS since the Penguin update at end of May. Any one else noticing this?
Algorithm Updates | | joncrowe0 -
Google said that low-quality pages on your site may affect rankings on other parts
One of my sites got hit pretty hard during the latest Google update. It lost about 30-40% of its US traffic and the future does not look bright considering that Google plans a worldwide roll-out. Problem is, my site is a six year old heavy linked, popular Wordpress blog. I do not know why the article believes that it is low quality. The only reason I came up with is the statement that low-quality pages on a site may affect other pages (think it was in the Wired article). If that is so, would you recommend blocking and de-indexing of Wordpress tag, archive and category pages from the Google index? Or would you suggest to wait a bit more before doing something that drastically. Or do you have another idea what I could to do? I invite you to take a look at the site www.ghacks.net
Algorithm Updates | | badabing0