Excel tips or tricks for duplicate content madness?
-
Dearest SEO Friends,
I'm working on a site that has over 2,400 instances of duplicate content (yikes!).
I'm hoping somebody could offer some excel tips or tricks to managing my SEOMoz crawl diagnostics summary data file in a meaningful way, because right now this spreadsheet is not really helpful. Here's a hypothetical situation to describe why:
Say we had three columns of duplicate content. The data is displayed thusly:
|
Column A
|
Column B
|
Column C
URL A
|
URL B
|
URL C
|
In a perfect world, this is easy to understand. I want URL A to be the canonical. But unfortunately, the way my spreadsheet is populated, this ends up happening:
|
Column A
|
Column B
|
Column C
URL A
|
URL B
|
URL C
URL B
|
URL A
|
URL C
URL C
|
URL A
|
URL B
|
Essentially all of these URLs would end up being called a canonical, thus rendering the effect of the tag ineffective. On a site with small errors, this has never been a problem, because I can just spot check my steps. But the site I'm working on has thousands of instances, making it really hard to identify or even scale these patterns accurately.
This is particularly problematic as some of these URLs are identified as duplicates 50+ times! So my spreadsheet has well over 100K cells!!! Madness!!! Obviously, I can't go through manually. It would take me years to ensure the accuracy, and I'm assuming that's not really a scalable goal.
Here's what I would love, but I'm not getting my hopes up. Does anyone know of a formulaic way that Excel could identify row matches and think - "oh! these are all the same rows of data, just mismatched. I'll kill off duplicate rows, so only one truly unique row of data exists for this particular set" ? Or some other work around that could help me with my duplicate content madness?
Much appreciated, you Excel Gurus you!
-
Choose one of the URL's as the authoritive and remove the dupped content from the others.
-
FMLLC,
I use Excel 2010 so my approach would be as follows:
-
Make a backup copy of your file before you start.
-
You will need to sort each row by value, but Excel has a 3 sort level limit, so you will need to add a macro.
-
Assuming your data starts in A1 and has no header row, Put it in a general module, go back to excel, activate your sheet, then run the macro from Tools=>Macro=>Macros.
Sub SortEachRowHorizontal()
Dim rng As Range, rw As Range
Set rng = Range("A1").CurrentRegion
For Each rw In rng.Rows
rw.Sort Key1:=rw(1), _
order1:=xlAscending, _
Header:=xlNo, _
OrderCustom:=1, _
MatchCase:=False, _
Orientation:=xlLeftToRight
Next
End Sub
- Then Highlight all your cells and then go to Data -> Remove Duplicates
The result should be all unique rows. I hope this helps.
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to deal with auto generated pages on our site that are considered thin content
Hi there, Wondering how to deal w/ about 300+ pages on our site that are autogenerated & considered thin content. Here is an example of those pages: https://app.cobalt.io/ninp0 The pages are auto generated when a new security researcher joins our team & then filled by each researcher with specifics about their personal experience. Additionally, there is a fair amount of dynamic content on these pages that updates with certain activities. These pages are also getting marked as not having a canonical tag on them, however, they are technically different pages just w/ very similar elements. I'm not sure I would want to put a canonical tag on them as some of them have a decent page authority & I think could be contributing to our overall SEO health. Any ideas on how I should deal w/ this group of similar but not identical pages?
Moz Pro | | ChrissyOck0 -
How to find those website who are using our content
I'm tring to figure it out that by using seo moz how can i find all website who are using our content.
Moz Pro | | Showhow20 -
Duplicate Content
Crawl Diagnostics is returning duplicate content/title tags for every product image on listing pages of my classified site because each image is on a separate url. So this page, for example, http://marketplace.myclassicgarage.com/cars/all/Chevrolet-Bel-Air/24481/ has, among other things, the same title tag as all this page, http://marketplace.myclassicgarage.com/cars/all/Chevrolet-Bel-Air/24481/media/151968 which is one of many different images that are all child pages in the folder /media In this particular case there are over 140 pages with the same title tag because there are over 140 images for this particular car. That is just one listing and there are over 1,000 listings (vehicles) and that number will grow. Is this really a problem? With limited resources, what real positive effect will making all these images have unique title tags really have from a SERP perspective? Keep in mind this being user generated content, there is no way to descriptively update the title tags to something like <title>Bel Air Passenger Side Profile</title>. That is not feasible.
Moz Pro | | MyClassicGarage0 -
Advice for 4000+ duplicate errors on 1st check
Hi, 1st time use of the SEOMOZ scan has thrown up a lot of duplicate errors. Seems to look like my site has a .com.au/ & .com.au/default for the same pages. We had the domain on a hosted cms solution & have now migrated to magento. We duplicated the pages, but had to redirect all of the old url's to he new magento structure. This was done via a developer adding a 301 wildcard code to the .htaccess. Would that many errors be normal for a 1st scan? Where should I look for someone to fix them? Thanks
Moz Pro | | Paul_MC0 -
Duplicate Page Content and Title - Miva - How to fix?
Hi, I'm new to SEOmoz and just diving into it. I'm feeling a bit overwhelmed. I use Miva Merchant as my storefront interface. SEMOz is returning a bunch of duplicate page content and duplicate page titles and I can't figure out what to do about it. It seems it may have something to do with Miva shortlinks. I click on the dup URL's in SEMOz and it brings me to a dead page. I can't figure out where it's coming from. I know without seeing the actual information it'll probably be tough to help me but any suggestions would be appreciated. I try to fix them and come to a point (after about three hours of getting nowhere) it becomes too frustrating. Thanks!
Moz Pro | | musicforkids
Gary0 -
Duplicate page content showing up with proper use of canonical tag
Hi, In the Crawl diagnostics reports, I'm getting lots of duplicate errors warnings e.g. duplicate page title. In most cases these are tracking urls and the page has a canonical tag pointing to the original page. It would be helpful if the crawl analysis reports could separate these out from ones that are of genuine concern. It can also happen when there's a noindex tag on a page. Thanks, Leigh
Moz Pro | | Leighm0 -
Port 80 and Duplicate Content
The SEOmoz Web App is showing me that every single URL on one of my clients' domains has a duplicate in the form of the URL + :80. For instance, the app is showing me that www.example.com/default.aspx is duplicated in the form of www.example.com:80/default.aspx Any idea if this is an actual problem or just some kind of reporting error? Any help would be appreciated.
Moz Pro | | AnthonyMangia0 -
How to remove Duplicate content due to url parameters from SEOMoz Crawl Diagnostics
Hello all I'm currently getting back over 8000 crawl errors for duplicate content pages . Its a joomla site with virtuemart and 95% of the errors are for parameters in the url that the customer can use to filter products. Google is handling them fine under webmaster tools parameters but its pretty hard to find the other duplicate content issues in SEOMoz with all of these in the way. All of the problem parameters start with ?product_type_ Should i try and use the robot.txt to stop them from being crawled and if so what would be the best way to include them in the robot.txt Any help greatly appreciated.
Moz Pro | | dfeg0