Excel tips or tricks for duplicate content madness?
-
Dearest SEO Friends,
I'm working on a site that has over 2,400 instances of duplicate content (yikes!).
I'm hoping somebody could offer some excel tips or tricks to managing my SEOMoz crawl diagnostics summary data file in a meaningful way, because right now this spreadsheet is not really helpful. Here's a hypothetical situation to describe why:
Say we had three columns of duplicate content. The data is displayed thusly:
|
Column A
|
Column B
|
Column C
URL A
|
URL B
|
URL C
|
In a perfect world, this is easy to understand. I want URL A to be the canonical. But unfortunately, the way my spreadsheet is populated, this ends up happening:
|
Column A
|
Column B
|
Column C
URL A
|
URL B
|
URL C
URL B
|
URL A
|
URL C
URL C
|
URL A
|
URL B
|
Essentially all of these URLs would end up being called a canonical, thus rendering the effect of the tag ineffective. On a site with small errors, this has never been a problem, because I can just spot check my steps. But the site I'm working on has thousands of instances, making it really hard to identify or even scale these patterns accurately.
This is particularly problematic as some of these URLs are identified as duplicates 50+ times! So my spreadsheet has well over 100K cells!!! Madness!!! Obviously, I can't go through manually. It would take me years to ensure the accuracy, and I'm assuming that's not really a scalable goal.
Here's what I would love, but I'm not getting my hopes up. Does anyone know of a formulaic way that Excel could identify row matches and think - "oh! these are all the same rows of data, just mismatched. I'll kill off duplicate rows, so only one truly unique row of data exists for this particular set" ? Or some other work around that could help me with my duplicate content madness?
Much appreciated, you Excel Gurus you!
-
Choose one of the URL's as the authoritive and remove the dupped content from the others.
-
FMLLC,
I use Excel 2010 so my approach would be as follows:
-
Make a backup copy of your file before you start.
-
You will need to sort each row by value, but Excel has a 3 sort level limit, so you will need to add a macro.
-
Assuming your data starts in A1 and has no header row, Put it in a general module, go back to excel, activate your sheet, then run the macro from Tools=>Macro=>Macros.
Sub SortEachRowHorizontal()
Dim rng As Range, rw As Range
Set rng = Range("A1").CurrentRegion
For Each rw In rng.Rows
rw.Sort Key1:=rw(1), _
order1:=xlAscending, _
Header:=xlNo, _
OrderCustom:=1, _
MatchCase:=False, _
Orientation:=xlLeftToRight
Next
End Sub
- Then Highlight all your cells and then go to Data -> Remove Duplicates
The result should be all unique rows. I hope this helps.
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Large site with content silo's - best practice for deep indexing silo content
Thanks in advance for any advice/links/discussion. This honestly might be a scenario where we need to do some A/B testing. We have a massive (5 Million) content silo that is the basis for our long tail search strategy. Organic search traffic hits our individual "product" pages and we've divided our silo with a parent category & then secondarily with a field (so we can cross link to other content silo's using the same parent/field categorizations). We don't anticipate, nor expect to have top level category pages receive organic traffic - most people are searching for the individual/specific product (long tail). We're not trying to rank or get traffic for searches of all products in "category X" and others are competing and spending a lot in that area (head). The intent/purpose of the site structure/taxonomy is to more easily enable bots/crawlers to get deeper into our content silos. We've built the page for humans, but included link structure/taxonomy to assist crawlers. So here's my question on best practices. How to handle categories with 1,000+ pages/pagination. With our most popular product categories, there might be 100,000's products in one category. My top level hub page for a category looks like www.mysite/categoryA and the page build is showing 50 products and then pagination from 1-1000+. Currently we're using rel=next for pagination and for pages like www.mysite/categoryA?page=6 we make it reference itself as canonical (not the first/top page www.mysite/categoryA). Our goal is deep crawl/indexation of our silo. I use ScreamingFrog and SEOMoz campaign crawl to sample (site takes a week+ to fully crawl) and with each of these tools it "looks" like crawlers have gotten a bit "bogged down" with large categories with tons of pagination. For example rather than crawl multiple categories or fields to get to multiple product pages, some bots will hit all 1,000 (rel=next) pages of a single category. I don't want to waste crawl budget going through 1,000 pages of a single category, versus discovering/crawling more categories. I can't seem to find a consensus as to how to approach the issue. I can't have a page that lists "all" - there's just too much, so we're going to need pagination. I'm not worried about category pagination pages cannibalizing traffic as I don't expect any (should I make pages 2-1,000) noindex and canonically reference the main/first page in the category?). Should I worry about crawlers going deep in pagination among 1 category versus getting to more top level categories? Thanks!
Moz Pro | | DrewProZ1 -
Solving 'duplicate content' for page 2 of X for 1 blog post
Hi to all SEO wizards, For my Dutch blog google-plus-marketing.nl I'm using WordPress Genesis framework 2.0 with news theme pro 2.0 responsive theme. I love the out of the box SEO friendliness and features. One of those features is that it allows for a blog post or page to be divided into several pages. This results in MOZ signaling duplicate titles for all pages after the 1st page. Now I was thinking that a canonical url set to the first page should do the trick for me as I reason that the rank will go the the first page and the rest will not be seen as duplicates. Genesis does some good stuff on it's own and places the following meta tags in the header for the first page. All looks well and my question is about the same meta tags for the 2nd page and higher that I pasted below this one for the 1st page. Meta tags page 1 of X for blog post Meta tags page 2 of X for the same blog post Would it not be better to point the canonical url for page 2 till X to always point to the first page? In this case:
Moz Pro | | DanielMulderNL0 -
Duplicate Page content
I found these URLs in Issue: Duplicate Page Content | http://www.decoparty.fr/Products.asp?SubCatID=4612&CatID=139 1 0 10 1 http://www.decoparty.fr/Products.asp?SubCatID=4195&CatID=280 1 0 10 1 http://www.decoparty.fr/Catproducts.asp?CatID=124 | 28 | 0 | 12 | 1 |
Moz Pro | | partyrama0 -
Does SeoMoz realize about duplicated url blocked in robot.txt?
Hi there: Just a newby question... I found some duplicated url in the "SEOmoz Crawl diagnostic reports" that should not be there. They are intended to be blocked by the web robot.txt file. Here is an example url (joomla + virtuemart structure): http://www.domain.com/component/users/?view=registration and the here is the blocking content in the robots.txt file User-agent: * _ Disallow: /components/_ Question is: Will this kind of duplicated url errors be removed from the error list automatically in the future? Should I remember what errors should not really be in the error list? What is the best way to handle this kind of errors? Thanks and best regards Franky
Moz Pro | | Viada0 -
Sorting Dupe Content Pages
Hi, I'm no excel pro, and I'm having a bit of a challenge interpreting the Crawl Diagnostics export .csv file. I'd like to see at a glance which of my pages (and I have many) are the worst offenders for dupe content – ie. which have the most "Other URLs" associated with them. Thanks, would appreciate any advice on how other people are using this data, and/or how 'Moz recommends to do it. 🙂
Moz Pro | | ntcma0 -
SEOmoz indicating duplicate page content on one of my campaigns
Hello All, Alright, according to SEOmoz's PRO campaign manager, one of my websites is returning about 2,700 pages that supposedly have duplicate content. I checked a few of them manually and am not seeing where the issue lies. Is anyone else experiencing something similar to this and do you know if it is just a glitch with the crawl? Here are 2 of the pages it is indicating have dup page content: http://www.dieselpowerproducts.com/c-3120-1994-98-59l-12v-dodge-cummins-carbon-fiber-hoods.aspx http://www.dieselpowerproducts.com/c-90-dodge-cummins-94-02-59l-12v24v.aspx Any insight would be greatly appreciated! -Craig
Moz Pro | | ckilgore0 -
Should I worry about duplicate content errors caused by backslashes?
Frequently we get red-flagged for duplicate content in the MozPro Crawl Diagnostics for URLs with and without a backslash at the end. For example: www.example.com/ gets flagged as being a duplicate of www.example.com I assume that we could rel=canonical this, if needed, but our assumption has been that Google is clever enough to discount this as a genuine crawl error. Can anyone confirm or deny that? Thanks.
Moz Pro | | MackenzieFogelson0 -
RSS feed showing up as duplicate content
Hi, I've just run an SEOMOZ Pro scan for the first time and it is picking up duplicate content errors from the RSS feed. For some reason it seems to be picking up two feeds, for example: http://blog.clove.co.uk/2009/05/13/htc-touch-diamond2-review/feed/ http://blog.clove.co.uk/2009/05/19/htc-touch-diamond2-review-2/feed/ Does anyone know why this is happening and how I can resolve this? Thanks
Moz Pro | | pugh0