Removing duplicated content using only the NOINDEX in large scale (80% of the website).
-
Hi everyone,
I am taking care of the large "news" website (500k pages), which got massive hit from Panda because of the duplicated content (70% was syndicated content). I recommended that all syndicated content should be removed and the website should focus on original, high quallity content.
However, this was implemented only partially. All syndicated content is set to NOINDEX (they thing that it is good for user to see standard news + original HQ content). Of course it didn't help at all. No change after months. If I would be Google, I would definitely penalize website that has 80% of the content set to NOINDEX a it is duplicated. I would consider this site "cheating" and not worthy for the user.
What do you think about this "theory"? What would you do?
Thank you for your help!
-
-
it has been almost a year now from the massive hit. after that, there were also some smaller hits
-
we are putting effort into improvements. that is quite frustrating for me, because I believe that our effort is demolished by old duplicated content (that creates 80% of the website :-))
Yeah, we will need to take care about the link-mess...
Thank you! -
-
Yeah, this strategy will be definitely part of the guidelines for the editors.
One last question: do you know some good resources I can use as an inspiration?
Thank you so much..
-
We deleted thousands of pages every few months.
Before deleting anything we identified valuable pages that continued to receive traffic from other websites or from search. These were often updated and kept on the site. Everything else was 301 redirected to the "news homepage" of the site. This was not a news site, it was a very active news section on an industry portal site.
You have set 410 for those pages and remove all internal links to them and google was ok with that?
Our goal was to avoid internal links to pages that were going to be deleted. Our internal "story recommendation" widgets would stop showing links to pages after a certain length of time. Our periodic purges were done after that length of time.
We never used hard coded links in stories to pages that were subject to being abandoned. Instead we simply linked to category pages where something relevant would always be found.
Develop a strategy for internal linking that will reduce site maintenance and focus all internal links to pages that are permanently maintained.
-
Yaikes! Will you guys still pay for it if it's removed? If so, then combining below comments with my thoughts - I'd delete it, since it's old and not time relevant.
-
Yeah, paying ... we actually pay for this content (earlier management decisions :-))
-
EGOL your insights are very appreciated :-)!
I agree with you. Makes total sense.
So you didn't experience any problems removing outdated content (or "content with no traffic value") from your website? You have set 410 for those pages and remove all internal links to them and google was ok with that?
Redirecting useless content - you mean set 301 to the most relevant page that is bringing traffic?
Thank you sir
-
But I still miss the point of paying for the content that is not accessible from SE
- "paying"?
Is my understanding right, that if I would set canonical for these duplicates, Google has no reason to show this pages in the SERP?
- correct
-
HI Dimitrii,
thank you very much for your opinion. The idea of canonical links is very interesting. We may try that in the "first" phase. But I still miss the point of paying for the content that is not accessible from SE.
Is my understanding right, that if I would set canonical for these duplicates, Google has no reason to show this pages in the SERP?
-
Just seeing the other responses. Agree with what EGOL mentions. A content audit would be even better to see if there was any value at all on those pages (GA traffic, links, etc). Odds are though that there was not any and you already killed all of it with the noindex tag in place.
-
Couple of things here.
-
If a second Panda update has not occurred since the changes that were made then you may not get credit for the noindexed content. I don't think this is "cheating" as with the noindex, it just told Google to take 350K of its pages out of the index. The noindex is one of the best ways to get your content out of Google's index.
-
If you have not spent time improving the non-syndicated content then you are missing the more important part and that is to improve the quality of the content that you have.
A side point to consider here, is your crawl budget. I am assuming that the site still internally links to these 350K pages and so users and bots will go to them and have to process etc. This is mostly a waste of time. As all of these pages are out of Google's index thanks to the noindex tag, why not take out all internal links to those pages (i.e. from sitemaps, paginated index pages, menus, internal content) so that you can have the user and Google focus on the quality content that is left over. I would then also 404/410 all those low quality pages as they are now out of Google's index and not linked internally. Why maintain the content?
-
-
Good point! News gotta be new
-
If there are 500,000 pages of "news" then a lot of that content is "history" instead of "news". Visitors are probably not consuming it. People are probably not searching for it. And actively visited pages on the site are probably not linking to it.
So, I would use analytics to determine if these "history" pages are being viewed, are pulling in much traffic, have very many links, and I would delete and redirect them if they are not important to the site any longer. This decision is best made at the page level.
For "unique content" pages that appear only on my site, I would assess them at regular intervals to determine which ones are pulling in traffic and which ones are not. Some sites place news in folders according to their publication dates and that facilitates inspecting old content for its continued value. These pages can then be abandoned and redirected once their content is stale and not being consumed. Again, this can best be done at the page level.
I used to manage a news section and every few months we would assess, delete and redirect, to keep the weight of the site as low as possible for maximum competitiveness.
-
Hi there.
NOINDEX !== no crawling. and surely it doesn't equal NOFOLLOW. what you probably should be looking at is canonical links.
My understanding is (and i can be completely wrong) that when you get hit by Panda for duplicate content and then try to recover, Google checks your website for the same duplicate content - it's still crawlable, all the links are still "followable", it's still scraped content, you aren't telling crawlers that you took it from somewhere else (by canonicalizing), it's just not displayed in SERPs. And yes, 80% of content being noindex probably doesn't help either.
So, I think that what you need to do is either remove that duplicate content whatsoever, or use canonical links to originals or (bad idea, but would work) block all those links in robots.txt (at least this way those pages will become uncrawlable whatsoever). All this still is unreputable techniques though, kinda like polishing the dirt.
Hope this makes sense.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Secondary Domain Outranking Master Website
IEEE is a large professional association dedicated to serving engineers. The IEEE Web Presence is made up of flagship sites like IEEE.org, IEEEXplore, and IEEE Spectrum, mid-tier sites like Computer.org, and smaller sites like those dedicated to specific conferences. It is unclear exactly when this started - but searches in Google for [ieee] currently return ieeeusa.org before ieee.org. This is troublesome, as users are typically looking for IEEE.org with such a general query. ieeeusa.org is a site that has a much narrower focus - it is dedicated to public policy. IEEE.org is one of the strongest domains - I am thinking that this is a glitch of some sort. I am removing a stale sitemap that is referenced in robots.txt (though again, I'm not seeing any issues with other pages - its just two queries that are trouble: [ieee] and [about ieee]. And its noticeable in analytics 🙂 http://ieee.d.pr/hMg0/YhklCw7Z What do you think? 🙂
White Hat / Black Hat SEO | | thegrif3290 -
Update: Copied Website
So I discovered a website the other day that is a complete duplicate of ours: justinchina.co.uk This is our website: petmedicalcenter.com . Thanks to help from Erica, I dug in deeper to see why this was happening. It seems that the justinchinca.co.uk which is hosted by GoDaddy has their A Record pointing at our web host. So that being said, our website does not seem to be hacked which is good news. Would this still cause an issue with our Google rankings? Our host, Host Monster said to contact GoDaddy and GoDaddy said that a domain owner can point their URL to anywhere that they choose. Anyway, any feedback would be helpful. Thanks for everyone thus far that has helped me. Brant
White Hat / Black Hat SEO | | BCB11210 -
Using a geolocation service to serve different banners in homepage. Dangers? Best Practices?
Hello, our website is used by customer in more than 100 countries. Becasuse the countries we serve are so many, we are using one single domain and homepage, without country specific content. Now, we are considering to use an geolocation service to identify the customer location and then to change the contents of one banner in the home page accordingly. Might this be dangerous from a SEO perspective? If yes, any suggesiton on how can we implement this to avoid troubles and penalties form the Search Engines? Thanks in advance for any help,Dario
White Hat / Black Hat SEO | | Darioz0 -
Guest post linking only to good content
Hello, We're thinking of doing guest posting of the following type: 1. The only link is in the body of the guest post pointing to our most valuable article. 2. It is not a guest posting site - we approached them to help with content, they don't advertise guest posting. They sometimes use guest posting if it's good content. 3. It is a clean site - clean design, clean anchor text profile, etc. We have 70 linking root domains. We want to use the above tactics to add 30 more links. Is this going to help us on into the future of Google (We're only interested in long term)? Is 30 too many? Thanks.
White Hat / Black Hat SEO | | BobGW0 -
Website slipped for particular keyword this year
www.schupepetents.co.nz I am webmaster for this site which slipped in rankings around 18/1/13. It was doing really well for keyword "marquee hire" ranking 6-8 for google.co.nz. It is now ranked about 30. Background. It has been ranked well for keyword for about 3 years. From what I can see there was a few websites that tumbled around this time. The website has been completely redone to a wordpress site this was in December last year. The switch was done before putting in 301 re directs to preserve internal page rank. As a result the internal page rank of pages has had to start again! Three things from online research into what has potentially affected this:
White Hat / Black Hat SEO | | chopchop
1. Lack of website substance. As website has been redone it has lost Google "esteem" (lack of internal pages PR). But the time that it dropped was a long time after the site redesign.
2. Possible penalty for too high proportion of links from directory websites. Heard some whispers from forums that this had happened. But links are not just from directories.
3. Too many links with specific keyword (marquee hire) for junk sites. This is possibly true but why the drop at that time when no one else experienced this. Also we hired a company mid last year who link bombed using "marquee hire" and a couple other keywords. Moz seems to be very happy with the website!! Good link scores and on-page optimisation is great. Wondering what has happened?!0 -
Advice on using the disavow tool to remove hacked website links
Hey Everyone, Back in December, our website suffered an attack which created links to other hacked webistes which anchor text such as "This is an excellent time to discuss symptoms, fa" "Open to members of the nursing/paramedical profes" "The organs in the female reproductive system incl" The links were only visible when looking at the Cache of the page. We got these links removed and removed all traces of the attack such as pages which were created in their own directory on our server 3 months later I'm finding websites linking to us with similar anchor text to the ones above, however they're linking to the pages that were created on our server when we were attacked and they've been removed. So one of my questions is does this effect our site? We've seen some of our best performing keywords drop over the last few months and I have a feeling it's due to these spammy links. Here's a website that links to us <colgroup><col width="751"></colgroup>
White Hat / Black Hat SEO | | blagger
| http://www.fashion-game.com/extreme/blog/page-9 | If you do view source or look at the cached version then you'll find a link right at the bottom left corner. We have 268 of these links from 200 domains. Contacting these sites to have these links removed would be a very long process as most of them probably have no idea that those links even exist and I don't have the time to explain to each one how to remove the hacked files etc. I've been looking at using the Google Disavow tool to solve this problem but I'm not sure if it's a good idea or not. We haven't had any warnings from Google about our site being spam or having too many spam links, so do we need to use the tool? Any advice would be very much appreciated. Let me know if you require more details about our problem. <colgroup><col width="355"></colgroup>
| | | |0 -
Merging four sites into one... Best way to combine content?
First of all, thank you in advance for taking the time to look at this. The law firm I work for once took a "more is better" approach and had multiple websites, with keyword rich domains. We are a family law firm, but we have a specific site for "Arizona Child Custody" as one example. We have four sites. All four of our sites rank well, although I don't know why. Only one site is in my control, the other three are managed by FindLaw. I have no idea why the FindLaw sites do well, other than being in the FindLaw directory. They have terrible spammy page titles, and using Copyscape, I realize that most of the content that FindLaw provides for it's attorneys are "spun articles." So I have a major task and I don't know how to begin. First of all, since all four sites rank well for all of the desired phrases-- will combining all of that power into one site rocket us to stardom? The sites all rank very well now, even though they are all technically terrible. Literally. I would hope that if I redirect the child custody site (as one example) to the child custody overview page on the final merged site, we would still maintain our current SERP for "arizona child custody lawyer." I have strongly encouraged my boss to merge our sites for many reasons. One of those being that it's playing havoc with our local places. On the other hand, if I take down the child custody site, redirect it, and we lose that ranking, I might be out of a job. Finally, that brings me down to my last question. As I mentioned, the child custody site is "done" very poorly. Should I actually keep the spun content and redirect each and every page to a duplicate on our "final" domain, or should I redirect each page to a better article? This is the part that I fear the most. I am considering subdomains. Like, redirecting the child custody site to childcustody.ourdomain.com-- I know, for a fact, that will work flawlessly. I've done that many times for other clients that have multiple domains. However, we have seven areas of practice and we don't have 7 nice sites. So child custody would be the only legal practice area that has it's own subdomain. Also, I wouldn't really be doing anything then, would I? We all know 301 redirects work. What I want is to harness all of this individual power to one mega-site. Between the four sites, I have 800 pages of content. I need to formulate a plan of action now, and then begin acting on it. I don't want to make the decision alone. Anybody care to chime in? Thank you in advance for your help. I really appreciate the time it took you to read this.
White Hat / Black Hat SEO | | SDSLaw0 -
Is a directory like this white hat? Useful?
This is one of my competitor's backlinks: http://bit.ly/mMPhmn Prices for inclusion on this page go from $50 for 6 months to $300 for a permanent listing. Do most of you guys do paid directories like this for your SEO Clients? My gut is telling me to run away...but I don't want to miss a good opportunity if I should be taking it.
White Hat / Black Hat SEO | | MarieHaynes0