Large site with faceted navigation using rel=canonical, but Google still has issues
-
First off, I just wanted to mention I did post this on one other forum so I hope that is not completely against the rules here or anything. Just trying to get an idea from some of the pros at both sources. Hope this is received well. Now for the question.....
"Googlebot found an extremely high number of URLs on your site:"
Gotta love these messages in GWT. Anyway, I wanted to get some other opinions here so if anyone has experienced something similar or has any recommendations I would love to hear them.
First off, the site is very large and utilizes faceted navigation to help visitors sift through results. I have implemented rel=canonical for many months now to have each page url that is created based on the faceted nav filters, push back to the main category page. However, I still get these damn messages from Google every month or so saying that they found too many pages on the site. My main concern obviously is wasting crawler time on all these pages that I am trying to do what they ask in these instances and tell them to ignore and find the content on page x.
So at this point I am thinking about possibly using robots.txt file to handle these, but wanted to see what others around here thought before I dive into this arduous task. Plus I am a little ticked off that Google is not following a standard they helped bring to the table.
Thanks for those who take the time to respond in advance.
-
Yes that's a different situation. You're now talking about pagination, which quite rightly, canonicals to parent page is not to be used.
For faceted/filtered navigation it seems like canonical usage is indeed the right way to go about it, given Peter's experience just mentioned above, and the article you linked to that says, "...(in part because Google only indexes the content on the canonical page, so any content from the rest of the pages in the series would be ignored)."
-
As for my situation it worked out quite nicely, I just wasn't patient enough. After about 2 months the issue corrected itself for the most part and I was able to reduce about a million "waste" pages out of the index. This is a very large site so losing a million pages in a handful of categories helped me gain in a whole lot of other areas and spread the crawler around to more places that were important for us.
I also spent some time doing some restructuring of internal linking from some of our more authoritative pages that I believe also assisted with this, but in my case rel="canonical" worked out pretty nicely. Just took some time and patience.
-
I should actually add that Google doesn't condone using rel-canonical back to the main search page or page 1. They allow canonical to a "View All" or a complex mix of rel-canonical and rel=prev/next. If you use rel-canonical on too many non-identical pages, they could ignore it (although I don't often find that to be true).
Vanessa Fox just did a write-up on Google's approach:
http://searchengineland.com/implementing-pagination-attributes-correctly-for-google-114970
I have to be honest, though - I'm not a fan of Google's approach. It's incredibly complicated, easy to screw up, doesn't seem to work in all cases, and doesn't work on Bing. This is a very complex issue and really depends on the site in question. Adam Audette did a good write-up:
http://searchengineland.com/five-step-strategy-for-solving-seo-pagination-problems-95494
-
Thanks Dr Pete,
Yes I've used meta no-index on pages that are simply not useful in any way shape or form for Google to find.
I would be hesitant noindexing my filters in question, but it sounds promising that you are backing the canonical approach and there is a latency on reporting. Our PA and DA is extremely high and we get crawled daily, so curious about your measurement tip (inurl) which is a good one!
Many thanks.
Simon
-
I'm working on a couple of cases now, and it is extremely tricky. Google often doesn't re-crawl/re-cache deeper pages for weeks or months, so getting the canonical to work can be a long process. Still, it is generally a very effective tag and can happen quickly.
I agree with others that Robots.txt isn't a good bet. It also tends to work badly with pages that are already indexed. It's good for keeping things out of the index (especially whole folders, for example), but once 1000s of pages are indexed, Robots.txt often won't clean them up.
Another option is META NOINDEX, but it depends on the nature of the facets.
A couple of things to check:
(1) Using site: with inurl:, monitor the faceted navigation pages in the Google index. Are the numbers gradually dropping? That's what you want to see - the GWT error may not update very often. Keep in mind that these numbers can be unreliable, so monitor them daily over a few weeks.
(2) Are there are other URLs you're missing? On a large, e-commerce site, it's entirely possibly this wasn't the only problem.
(3) Did you cut the crawl paths? A common problem is that people canonical, 301-redirect, or NOINDEX, but then nofollow or otherwise cut links to those duplicates. Sounds like a good idea, except that the canonical tag has to be crawled to work. I see this a lot, actually.
-
Did you find a solution for this? I have exactly the same issue and have implemented the rel canonical in exactly the same way.
The issue you are trying to address is improving crawl bandwidth/equity by not letting Google crawl these faceted pages.
I am thinking of Ajax loading in these pages to the parent category page and/or adding nofollow to the links. But the pages have already been indexed, so I wonder if nofollow will have any effect.
Have you had any progress? Any further ideas?
-
Because rel canonical does nothing more than give credit to teh chosen page and aviod duplicat content. it does not tell the SE to stop indexing or redirect. as far as finding the links it has no affect
-
thx
-
OK, sorry I was thinking too many pages, not links.
using no-index will not stop PR flowing, the search engine will still follow the links. -
Yeah that is why I am not real excited about using robots.txt or even a no index in this instance. They are not session ids, but more like:
www.example.com/catgeoryname/a,
www.example.com/catgeoryname/b
www.example.com/catgeoryname/c
etc
which would show all products that start with those letters. There are a lot of other filters too, such as color, size, etc, but the bottom line is I point all those back to just www.example.com/categoryname using rel canonical and am not understanding why it isn't working properly.
-
There are a large number of urls like this because of the way the faceted navigation works and I have considered no index, but somewhat concerned as we do get links to some of these urls and would like to maintain some of that link juice. The warning shows up in Google Webmaster tools when Googlebot finds a large number of urls. The rest of the message reads like this:
"Googlebot encountered extremely large numbers of links on your site. This may indicate a problem with your site's URL structure. Googlebot may unnecessarily be crawling a large number of distinct URLs that point to identical or similar content, or crawling parts of your site that are not intended to be crawled by Googlebot. As a result Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all of the content on your site."
rel canonical should fix this, but apparently it is not
-
Check how you are getting these pages.
Robots.txt is not an ideal solution. If Google finds pages in other places, still these pages will be crawled.
Normally print pages won't have link value and you may no index them.
If there are pages with session ids or campaign codes, use canonical if they have link value. Otherwise no index will be good.
-
the rel canonical with stop you getting duplicate content flags, but there is still a large number of pages its not going to hide them.
I have never seen this warning, how many pages are we talking about?, either it is very very high, or they are confusing the crawler.You may need to no index them
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Canonical redirect?
Can a canonical URL redirect? I'm doing country specific urls with the www. redirecting to the country (i.e. if you go to www.domain.com you'll redirect to fr.domain.com in france). If the canonical is www. then all the spiders will go to the correct place but I don't know if search engines recommend against a canonical that redirects.
Algorithm Updates | | mattdinbrooklyn0 -
Need Advice - Google Still Not Ranking
Hi Team - I really need some expert level advice on an issue I'm seeing with our site in Google. Here's the current status. We launched our website and app on the last week of November in 2014 (soft launch): http://goo.gl/Wnrqrq When we launched we were not showing up for any targeted keywords, long tailed included, even the title of our site in quotes. We ranked for our name only, and even that wasn't #1. Over time we were able to build up some rankings, although they were very low (120 - 140). Yesterday, we're back to not ranking for any keywords. Here's the history: While developing our app, and before I took over the site, the developer used a thin affiliate site to gather data and run a beta app over the course of 1 - 2 years. Upon taking on the site and moving to launch the new website/app I discovered what had been run under the domain. Since than the old site has been completely removed and rebuild, with all associated urls (.uk, .net, etc...) and subdomains shutdown. I've allowed all the old spammy pages (thousands of them to 404). We've disavowed the old domains (.net, .uk that were sending a ton of links to this), along with some links that seemed a little spammy that were pointing to our domain. There are no manual actions or messaged in Google Webmaster Tools. The new website uses (SSL) https for the entire site, it scores a 98 / 100 for a mobile usability (we beat our competitors on Google's PageSpeed Tool), it has been moved to a business level hosting service, 301's are correctly setup, added terms and conditions, have all our social profiles linked, linked WMT/Analytics/YouTube, started some Adwords, use rel="canonical", all the SEO 101 stuff ++. When I run the page through the moz tool for a specific keyword we score an A. When I did a crawl test everything came back looking good. We also pass using other tools. Google WMT, shows no html issues. We rank well on Bing, Yahoo and DuckDuckGo. However, for some reason Google will not rank the site, and since there is no manual action I have no course of action to submit a reconsideration request. From an advanced stance, should we bail on this domain, and move to the .co domain (that we own, but hasn't been used before)? If we 301 this domain over, since all our marketing is pointed to .com will this issue follow us? I see a lot of conflicting information on algorithmic issues following domains. Some say they do, some say they don't, some say they do since a lot of times people don't fix the issue. However, this is a brand new site, and we're following all of Google's rules. I suspect there is an algorithmic penalty (action) against the domain because of the old thin affiliate site that was used for the beta and data gathering app. Are we stuck till Google does an update? What's the deal with moving us up, than removing again? Thoughts, suggestions??? I purposely, did a short url to leave out the company name, please respect that, since I don't want our issues to popup on a web search. 🙂
Algorithm Updates | | get4it0 -
Schema.org Microdata or Microformats - Which Should We Use
Hi All, I'm wondering what would be the better alternative - schema.org microdata or microformats. I am aware that search engines such as Google, Yahoo, and Bing recognize Schema.org as the standard. Question is, will it have any negative affect? Our web developer here says that schema.org microdata may result in negative html. I don't think that it will affect our SEO, but I guess that's also something to shed some light on. So, what's the consensus here - should we implement schema.org or go with microformats - or, does it really make any difference?
Algorithm Updates | | CSawatzky1 -
Links from high Domain authority sites
I have a relatively uncompetitive niche ranking around number 6 for my keywords. Would getting a few links from some Moz DA 80-90 and DA 90-100 sites help my rankings a lot? Some of the pages linking to me from these sites might be deep in the site pretty far away from the home page with pagerank of "unranked" or a grayed out bar and these pages linking to me might not have many links at all other than from the internal links of the site itself and would have a Moz PA of 10 or 20. Would these pass much pagerank or authority to my site or would they not be worth going after? These links to my site would be in context on a blog. Thanks mozzers!
Algorithm Updates | | Ron100 -
What are tips for ranking on Google Maps?
I have another thread going where everyone is saying to keep both the Places profile as well as the Google Plus Local profile I have for my company. I have another person telling me that it has a negative effect to have both accounts at the same time so I'm assuming thats why the listing never comes up on places unless you zoom all the way into the map to the address of the storefront. With that being said, can anyone provide some good tips for ranking first page on google maps? Goole Plus Local - https://plus.google.com/114370561649922317296/about?gl=us&hl=en Google Places - https://plus.google.com/103220086647895058915/about?gl=us&hl=en
Algorithm Updates | | jonnyholt1 -
Google dance/over optimized/paranoid?
Hi guys, hope your all OK and thanks in advance for taking a nosey at this. OK where to start - my rankings for the last 12 months have progressively improved every week, usually of the 300 KWs i track the last few months has seen approx 70 up/70down per week, but the improvements usually outweigh the declines. This week I saw a sudden drop though - 35 improvements and 112 declines. The strange thing was though, the improvements came on the more competitive KWs, and the less competitive words I haven't done much or any back linking for dropped. Seems silly me asking this question when I run that through my head ofcouse KWs you don;t work on will drop like flies? It should be plainly obvious those words would drop off but all have been improving on there own slowly over the last 6/7 months. Now if this was a penalty (nothing showing in webmaster tools) I would have expected it to come through on my KWs I have over done the backlinking for, but these are the 1's that improved. So is it just the Google Dance? I normally see some words such as the big 1 we target DJ Equipment go from position 13 - 24 can change hourly sometimes! Could it just be quite a few have dropped all at once and will pop back up this week? Also if anyone could give us any pointers in general on where you think we should be taking our SEO it would be much appreciated. I know we have been a little lazy with our backlinking and could do with some much better/ industry related websites linking to us, and there are title tags/metas on product page that need sorting.. aside these couple of issue's? DJs Only
Algorithm Updates | | allan-chris0 -
Google Rankings Jumping Around
Hi, Since January, the Google rankings for one of our sites has been jumping around. Sometimes it's on page 1, then it disappears and comes back around 1 month later. It's strange because it's only a small section of the site that it's happening to. Every other section of the site is doing really well. Just wondered if anyone else is having this problem, or has had it and can suggest any fixes. There are no technical issues, no changes have been made to the site, all I can think is it's Google messing around with their algorithm? Any help or advice would be much appreciated. Karen
Algorithm Updates | | Digirank0 -
Site name appended to page title in google search
Hi there, I have a strange problem concerning how the search results for my site appears in Google. The site is Texaspoker.dk and for some strange reason that name is appended at the end of the page title when I search for it in Google. The site name is not added to the page titles on the site. If I search in Google.dk (the relevant search engine for the country I am targeting) for "Unibet Fast Poker" I get the following page title displayed in the search results: Unibet Fast Poker starter i dag - få €10 og prøv ... - Texaspoker.dk If you visit the actual page you can see that there is no site name added to the page title: http://www.texaspoker.dk/unibet-fast-poker It looks like it is only being appended to the pages that contains rich snippets markup and not he forum threads where the rich snippets for some reason doesn't work. If I do a search for "Afstemning: Foretrukne TOPS Events" the title appears as it should without the site name being added: Afstemning: Foretrukne TOPS Events Anybody have any experience regarding this or an idea to why this is happening? Maybe the rich snippets are automatically pulling the publisher name from my Google+ account... edited: It doesn't seem to have anything to do with rich snippets, if I search for "Billeder og stuff v.2" the site name is also appended and if I search for "bedste poker bonus" the site name is not.
Algorithm Updates | | MPO0