Wow, yes - sorry about that. I've updated it. Google original write-up actually covers this case, too (it's toward the end):
http://googlewebmastercentral.blogspot.com/2011/09/pagination-with-relnext-and-relprev.html
Welcome to the Q&A Forum
Browse the forum for helpful insights and fresh discussions about all things SEO.
Wow, yes - sorry about that. I've updated it. Google original write-up actually covers this case, too (it's toward the end):
http://googlewebmastercentral.blogspot.com/2011/09/pagination-with-relnext-and-relprev.html
I tend to agree - you always run the risk with cross-domain canonical that Google might not honor it, and the you've got a major duplicate content problem on your hands.
I think there's a simpler reason, in most cases, though. Three unique sites/brands take 3X (or more, in practice) the time and energy to promote, build links to, build social accounts for, etc. That split effort, especially on the SEO side, can far outweigh the brand benefits, unless you have solid resources to invest (read that "$$$").
To be fair, I don't know your strategy/niche, but I've just found that to be true 95% of the time in these cases. Most of the time, I think building sub-brands on sub-folders within the main site and only having one of each product page is a better bet. The other advantage is that users can see the larger brand (it lends credibility) and can move between brands if one isn't a good match.
The exception would be if there's some clear legal or competitive reason the brands can't be publicly associated. In most cases, though, that's going to come with a lot of headaches.
Just to clarify on a point William raised - do you want the "doorway" page to rank? That's where the trouble usually starts. Vanity URLs and landing pages are fine - if you canonical them, NOINDEX (or even 301-redirect, if you don't care if the URL changes), then Google isn't going to see any attempt at manipulative content.
The other issue is generally relevance. The problem in the past was when people used a doorway page to bring visitors in on one term or set of terms and then that was just a link to a wildly different site. If the landing page is relevant and non-duplicate, it's not nearly as big of a deal.
Personally, I would not do private registration, separate hosting, etc., because that honestly looks like you are trying to hide this page and get away with something. Just don't index the page, and you should be fine. I've used landing pages for paid search on all kinds of URLs, and as long as organic didn't see them, it's no problem at all.
This gets tricky fast. Google currently wants rel=prev/next to contain the parameters currently in use (like sorts) for the page you're on and then wants you rel-canonical to the non-parameterized version. So, if the URL is:
http://www.virtualsheetmusic.com/downloads/Indici/Guitar.html?cp=3&lpg=40
...then the tags should be...
Yeah, it's a bit strange. They have suggested that it's ok to rel-canonical to a "View All" page, but with the kind of product volume you have, that's generally a bad idea (for users and search). The have specifically recommended against setting rel-canonical to Page 1 of search results, especially if you use rel=prev/next.
Rel=prev/next will still show pages in the index, but I've found it to work pretty well. The other option is the more classic approach to simple META NOINDEX, FOLLOW pages 2+. That can still be effective, but it's getting less common.
Adam Audette has generally strong posts about this topic - here's a good, recent one:
http://searchengineland.com/the-latest-greatest-on-seo-pagination-114284
My gut reaction is the same as Marie's - it doesn't sound like a typical negative SEO approach. It seems more likely that a past SEO attempt by a 3rd party went farther then you thought or got spread/duplicated somehow. Unfortunately, there's virtually no way to tell who created a link (let alone why).
If many of these are coming from just a few domains, I'd also recommend using the disavow tool. I do agree with James that these links could potentially harm you, even no-followed.
I'd look for any common clues as to who might be leaving these comments, though. I'd just want to make absolutely sure it's not someone operating on your behalf. I've seen that way too many times in the past.
Just to second what Mike said - it's always tough to speak in generalities, but I can't think of any benefit to this approach. Typically, 301s are the preferred method for changing URLs. If you just kill the old pages and introduce new ones with the same content, you not only may experience some short-term duplicate content issues, but you lose inbound links and ranking signals to those old URLs.
Are you concerned about transferring a penalty via 301s? I'm just not clear on what the goal is here.
I have to disagree on this one. If Google honors a canonical tag, the non-canonical page will generally disappear from the index, at least inasmuch as we can measure it (with "site:", getting it to rank, etc.). It's a strong signal in many cases.
This is part of the reason Google introduced rel=prev/next for paginated content. With canonical, pages in the series aren't usually able to rank. Rel=prev/next allows them to rank without clogging up the index (theoretically). For search pagination, it's generally a better solution.
If your paginated content is still showing in large quantities in the index, Google may not be honoring the canonical tag properly, and they could be causing duplicate content issues. It depends on the implementation, but they recommend these days that you don't canonical to the first page of search results. Google may choose to ignore the tag in some cases.
I'm not aware of it having any practical value, for SEOs or users. It's hard to even find what the original intent was - most references are just lists of tags that exist. Some 3rd-party applications could use it, but I've never heard of it impacting Google search.
We do handle it a bit differently - we try to flag near duplicates by looking at source code. Glancing at a few of the instances on your site, I think we're getting a bit hung up by all of the code for the menus (like the drop-down options). It's really heavy HTML, so when only a couple of search results are different, it's making the pages seem too similar.
On the one hand, I think Google does know to ignore some aspects, like menus, and the distinct META data does help. On the other hand, search results pages, especially ones with limited or similar results, are considered fairly low value by Google, and you've got a ton of them. By trying to rank all of these variations, you probably are diluting your index quite a bit.
So, I'd say that we're being overzealous here, but I'd also say that it's indicative of a problem to some extent.
It's really tough to tell without specifics, but a couple of suggestions:
(1) If you're getting a mis-match between what you see and what we see (such as a 404 error for a page that seems to load), I'd run a header checker, like:
http://tools.seobook.com/server-header-checker/
Make sure that your pages are being served up correctly (it's hard to see through a browser).
(2) For duplicates, it depends a lot on why they're being generated. Often, it's a URL variation or minor change that's creating dozens or hundreds of pages with the same title. Keep in mind that every unique URL is a "page" to a crawler.
I've got a mega-post on the subject here - it's complicated:
http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world
Without seeing the actual site in question, that's my opinion, yes.
I honestly don't think that's a big deal - as long as you aren't creating tags or adding categories in a way that this could spin out of control. You've basically got 20-ish search result pages. They aren't high value, but they are useful paths to the blog content and they could rank for category keywords. I think it's a balancing act, and in many cases internal search can spin out of control and harm a site. My gut reaction, though, is that you're not in that situation, and cutting off these pages might do more harm than good.
Are these just snippets (link + paragraph) or are you displaying large portions of the posts on the home/category pages?
It depends a bit on the site structure, but I'd actually be wary of setting the category page canonicals back up to the main blog. These aren't really duplicates, and that could send an odd signal (and potentially negative) to Google, especially if there are a lot of them.
If you're talking about a few category pages, leave it alone. Use rel=prev/next for pagination and make sure you're handling and search filters (and not spinning out URLs), but just let these pages get crawled normally. They're an important path on the site.
If you've got a ton of categories, sub-categories, and tags, then I'd go with META NOINDEX. Important note, though: in most cases, you'd use NOINDEX, FOLLOW (not NOFOLLOW) - you don't want to cut the path for crawlers to reach your individual posts. Again, this does depend a bit on the site architecture and whether you have other crawl paths.
The main issue with too many on-page links is just dilution - there's not a hard limit, but the more links you have, the less value each one has. It's an unavoidable reality of internal site architecture and SEO.
Nofollow has no impact on this problem - link equity is still used up, even if the links aren't follow'ed. Google changed this a couple of years back due to abuse of nofollow for PageRank sculpting.
Unfortunately, I'm having a lot of issues loading your site, even from Google's cache, so I'm not able to see the source code first-hand.
I have to disagree with Bryan, I'm afraid - I think you carry substantial risk here, and this is a tricky decision. While EMD influence is declining, it still can carry a lot of weight (and quite a bit more than sub-domain keywords). If most of your traffic is coming from those "head" terms, you may see a serious loss by moving from EMDs to sub-domains.
Sub-domains have other issues, too, like fragmentation. Since the verticals are very different, Google could treat each sub-domain more like a separate domain. Then, your link equity won't consolidate AND you'll lose the EMD advantage. So, there's actually a risk of a worst-of-both-worlds scenario.
Now, to be fair - consolidation can have benefits, like unifying your link profiles, simplifying your other marketing efforts (one site to promote on social media), etc. Also, since your niches are really just different marketing perspectives on the same product, it's possible that your current sites might look a little thin to Google. In that case, consolidation could help, but "consolidation" would mean thinning out the separate pages, not just moving to one domain with a bunch of sub-domains.
Whether it's better for users really depends on your customer base. Do they tend to look for chat products as a general product, and then decide how it fits their industry, or do they look for products targeted to their industry? If the latter, then the separate domains might actually be more user-friendly.
Sorry, I know this is clear as mud, but I just want you to be aware of the complexity and possible issues. I would not make this decision lightly. Please note, too, that I'm generally in favor of consolidation and am not a big fan of an EMD-based strategy. We have to be realistic about what works now, though, vs. what may work in a couple of years, and I'm just concerend about the short-term impact for you.
My gut reaction, long-term, is that you could build a more product-focused site that has solid landing pages for each vertical, and that each vertical may not need a sub-site. This could create a stronger single site over time. It really depends how much unique content you've got within each vertical, and how your visitors find you. Even if that's a good long-term strategy, it could still have short-term negative impact, so you have to be aware of that and able to weather it.
You could canonical the "/portable-hard-disk" pages back up to "/hard-disk", but honestly, unless this is a widespread problem, I'd probably ignore it. if you have a lot of these sub-categories with duplicate search results, then I'd consider changing up your canonical scheme or NOINDEX'ing some sub-categories - search results just aren't high-value to Google, especially if they start all looking the same.
If this is an isolated occurrence, though, it's a lot of trouble for a relatively minor problem. It would take a pretty deep knowledge of your product inventory and site structure to know for sure, but my gut reaction is that this is a small issue.
I talked to the technical team. The screen may be a bit confusing. Your "items_per_page" variations are not being flagged as a duplicate of "/hard-disk/portable-hard-disk/". All of the pages (including the items_per_page variants) are being flagged as near-duplicates (95%+) of "/hard-disk". Basically, since those pages show the exact same products and only differ by a header, we're flagging them as being too similar. Once we do that, then all of the other pages that canonical to the "/portable-hard-disk" page also look like near-duplicates of "/hard-disk".
It's not catastrophic, but if you have enough of these category/sub-category search pages that overlap on their results, you may want to reconsider whether you index all of them. At small scale, it's not a big deal. At large scale, these very similar pages could dilute your ranking ability.
Just to add to the consensus (although credit goes to multiple people on the thread) - PR-sculpting with nofollow on internal links no longer works, and it can be counter-productive. If these links are needed for users, don't worry about them, and don't disrupt PR flow through your site. Ultimately, you're only talking about a few pages, and @sprynewmedia is right - Google probably discounts footer links even internally (although we may no good way to measure this).
Be careful with links like "register", though, because sometimes they spin off URL variations, and you don't want those all indexed. In that case, you'd probably want to NOINDEX the target page - it just doesn't have any search value. I'm not seeing that link in your footer, though, so I'm not clear on what it does. I see this a lot with "login" links.
100% agreed - 403 isn't really an appropriate alternative to 404. I know SEOs who claim that 410s are stronger/faster, but I haven't seen great evidence in the past couple of years. It's harmless to try 410s, but I wouldn't expect miracles.
Yeah, no argument there. I worry about it from an SEO standpoint, but sometimes there really isn't a lot you can do, from a business standpoint. I think it's occasionally worth a little fight, though - sometimes, when all the dealers want to have their cake and eat it, too, they all suffer (at least, post-Panda). Admittedly, that's a long, difficult argument, and you have to decide if it's worth the price.
Let me jump in and clarify one small detail. If you delete a page, which would naturally result in a 404, but then 301-redirect that page/URL, there is no 404. I understand the confusion, but ultimately you can only have one HTTP status code. So, if the page properly 301s, it will never return a 404, even if it's technically deleted.
If the page 301s to a page that looks like a "not found" sort of page (content-wise), Google could consider that a "soft 404". Typically, though, once the 301 is in place, the 404 is moot.
For any change in status, the removal of crawl paths could slow Google re-processing those pages. Even if you delete a page, Google has to re-crawl it to see the 404. Now, if it's a high-authority page or has inbound (external) links, it could get re-crawled even if you cut the internal links. If it's a deep, low-value page, though, it may take Google a long time to get back and see those new signals. So, sometimes we recommend keeping the paths open.
There are other ways to kick Google to re-crawl, such as having an XML sitemap open with those pages in them (but removing the internal links). These signals aren't as powerful, but they can help the process along.
As to your specific questions:
(1) It's very tricky, in practice, especially at large-scale. I think step 1 is to dig into your index/cache (slice and dice with the site: operator) and see if Google has removed these pages. There are cases where massive 301s, etc. can look fishy to Google, but usually, once a page is gone, it's gone. If Google has redirected/removed these pages, and you're still penalized, then you may be fixing the wrong problem or possibly haven't gone far enough.
(2) It really depends on the issue. If you cut too deep and somehow cut off crawl paths or stranded inbound links, then you may need to re-establish some links/pages. If you 301'ed a lot of low-value content (and possibly bad links), you may actually need to cut some of those 301s and let those pages die off. I agree with @mememax that sometimes a helathy combination of 301s/404s is a better bet - pages go away, and 404s are normal if there's really no good alternative to the page that's gone.
We don't currently have a way to ignore warnings/errors, although I know that's on the wish list. Let me ping the Product Team on this one and see if they have any additional insight.
I have to disagree with Mike a bit - this is the kind of situation that can cause problems, and I think the duplication across the industry actually makes it even more likely. Yes, the big players can get away with it, and Google understands the dynamic to some degree, but if you have a new site or smaller brand, you could greatly weaken your ranking ability. You especially have to be careful out the gate, IMO, when your authority is weak.
To be fair, I'm assuming you're a small to mid-sized player and not a major brand, so if that's an incorrect assumption, let me know.
There aren't many have-your-cake-and-eat-it-too approaches to duplicate content in 2013. If you use rel=canonical, NOINDEX, etc. then some version of the page won't be eligible for ranking. If you don't, then the pages could dilute each other or even harm the ranking of the overall site. Each product won't "carry the same weight in search" - if you don't pick, Google will, and your internal site architecture and inbound link structure is always going to weight some pages more highly than others. Personally, I think it's better to choose than have the choice made for you (which is usually what happens).
I'd also wonder if this structure is really that great for users - people don't want to happen across nine versions of the same page, that only differ by the branch. The branch is your construct, not theirs, and it's important to view this from the visitor perspective.
Unfortunately, I don't understand the business/site well enough to give you a great alternative. Is there a way to create a unified product URL/page, but still give the branch credit when a visitor hits the product via their sub-site. For example, you could cookie the visitor and then show the branches template (logo, info, etc.) at the top of the page, but still keep one default URL that Google would see. As long as new visitors to the site also see that default, it's not a problem.
As best I can tell, your canonical tags are properly implemented and Google doesn't seem to be indexing any URLs with "items_per_page" in them. Our crawler and desktop crawlers may be getting confused because there are internal paths to these variations.
Ideally, that pulldown probably shouldn't be crawlable, but I think your canonical implementation as it stands is ok. I don't see any evidence that Google is having problems with it. It may just be a false alarm on our part.
Did Google process the 301s? In other words, are the old pages still in the index or not? If they processed the 301s eventually, you generally should be ok. If the old URLs seem stranded, then you might be best setting up the XML sitemap with those old URLs to just kick Google a little. I don't think I'd switch signals and move from a 301 to 404, unless the old pages are low quality, had bad links, etc.
Unfortunately, these things are very situational, so it can be hard to speak in generalities.
If I'm (we're) understanding your situation correctly, then I'd have to agree with Mike. You should 301-redirect all of the versions, not "chain" the canonical to a 301. That's going to produce very unpredictable results at best.
I'd generally say "no", but I suspect that there may be a point where the ratio of noindex'ed to indexed pages starts making Google wonder about your content quality. I think it also depends on where those pages fall in the site architecture. If you noindex everything below a certain point, that my be completely reasonable (I've done it without incident - ex. shopping cart pages, etc.). If you start noindex'ing major pages or it looks suspicious (like you're trying to PR-sculpt) I could see that raising red flags.
I doubt it's an issue, but I just think it's possible to take anything too far, especially if the usage seems deceptive or spammy in some way. Without knowing the situation, though, my gut reaction is that it's not a problem in 90% of cases.
I'd tend to agree with Nakul on proceeding with caution - while Google doesn't necessarily treat "_" as a word separator, the URL is just one relatively small ranking factor. There are many risks in a site-wide 301-redirect, especially when you're redesigning. If the redesign runs into SEO trouble, you're not going to be able to separate the many changes, and that could delay fixing any problems.
The exception would be if you're planning to change a lot of the URLs anyway, as part of the redesign. Then, I'd go ahead and do it all at once. Hyphens are a nice-to-have - I'm just not sure that, practically, the risks outweigh the rewards. It does depend a lot on how you're currently ranking and whether the URLs are causing you any major headaches.
When you say "disavowed", you mean that you've specifically used the disavow tool Google provides? In that case, you shouldn't need to have them re-crawled - my best guess from talking to other SEOs is that the disavow file basically acts as a layer of data on top of the link graph.
Now, if you've had links removed, and you want Google to acknowledge the removal, then yes, you'll need to get the pages re-crawled - or else just let time kill them off (but that could take a while). This is tricky, though - you don't necessarily want to promote a spammy page or drive more authority to it.
If it was one site you controlled, you could use XML sitemaps, the Webmaster Tools URL submission form, or a service like Ping-O-Matic (http://pingomatic.com/) to nudge Google to re-crawl, but most of those solutions don't work for a bunch of URL from other people's sites.
So, you're left building links or somehow drawing attention to them, which can be dangerous. You can promote them in social, too, but again, then you're basically vouching for those pages, and that's not exactly going to build your social accounts.
If you're using the GWT disavow, I'd just give it a time. Otherwise, I'd probably try something like pinging (you'll have to hack together a list of the URLs somehow, and maybe publish them to an RSS feed) - I think that's the lowest risk alternative.
Sorry - I got an email that you left a response, but then it was gone(?) Might be a system glitch. Could you reply again?