API for testing duplicate content
-
Does anyone know a service or API or php lib to compare two (or more) pages and to return their similiarity (Level-3-Shingles).
API would be greatly prefered.
-
Hey Erica,
thanks for your answer. What I need is a way to decide on-the-fly whether two pages are similar or not. If they are too similar I need to depublish or at least rel canonical one of those.
Best solution would be an API that takes 2 pages, but it seems as if I have to build it myself then.
Thanks for your efforts.
-
While I don't know of an API that does that, you can set up your site using the SEOmoz tools and our Crawl Diagnostics section does look for Duplicate Cotent.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
'duplicate content' on several different pages
Hi, I've a website with 6 pages identified as 'duplicate content' because they are very similar. This pages looks similar because are the same but it show some pictures, a few, about the product category that's why every page look alike each to each other but they are not 'exactly' the same. So, it's any way to indicate to Google that the content is not duplicated? I guess it's been marked as duplicate because the code is 90% or more the same on 6 pages. I've been reviewing the 'canonical' method but I think is not appropriated here as the content is not the same. Any advice (that is not add more content)?
Technical SEO | | jcobo0 -
Duplicate content and rel canonicals?
Hi. I have a question relating to 2 sites that I manage with regards to duplicate content. These are 2 separate companies but the content is off a data base from the one(in other words the same). In terms of the rel canonical, how would we do this so that google does not penalise either site but can also have the content to crawl for both or is this just a dream?
Technical SEO | | ProsperoDigital0 -
Database driven content producing false duplicate content errors
How do I stop the Moz crawler from creating false duplicate content errors. I have yet to submit my website to google crawler because I am waiting to fix all my site optimization issues. Example: contactus.aspx?propid=200, contactus.aspx?propid=201.... these are the same pages but with some old url parameters stuck on them. How do I get Moz and Google not to consider these duplicates. I have looked at http://moz.com/learn/seo/duplicate-content with respect to Rel="canonical" and I think I am just confused. Nick
Technical SEO | | nickcargill0 -
Drupal SEO help - Duplicate content but very similar URLS?
Hi, This is a very strange problem and not sure how it has happened. I am adding packages to my website and a duplicate page & almost identical URL is being picked up by Google. E.g. the page I make is http://www.ukgirlthing.co.uk/hen-party/bristol-spa-rty-lunch-pampering-h... but then also appearing is http://www.ukgirlthing.co.uk/hen-party/bristol-spa-rty-pampering-hen-party. The node's are exactly the same, and if i edit one of them, the other also updates. You will notice that the URL's are almost exactly the same, except the words are re-organised slightly? Shall i just delete the URL alias of the duplicate entry or is there something else which is making this happen? These URL's are being picked up as duplicate content, although it's the same node! Hope you can help, Thank you!
Technical SEO | | Party_Experts0 -
Duplicated content in news portal: should we use noindex?
Hello, We have a news portal, and like other newspapers we have our own content and content from other contributors. Both our content and our contributors content can be found in other websites (we sell our content and they give theirs to us). In this regard, everything seems to work fine from the business and users perspective. The problem is that this means duplicated content... so my question is: "Should we add the noindex,nofollow" tag to these articles? Notice that there might be hundreds of articles everyday, something like a 1/3 of the website. I checked one newspaper which uses news from agencies, but they seem not to use any noindex tag. Not sure what others do. I would appreciate any opinion on that.
Technical SEO | | forex-websites0 -
Development Website Duplicate Content Issue
Hi, We launched a client's website around 7th January 2013 (http://rollerbannerscheap.co.uk), we originally constructed the website on a development domain (http://dev.rollerbannerscheap.co.uk) which was active for around 6-8 months (the dev site was unblocked from search engines for the first 3-4 months, but then blocked again) before we migrated dev --> live. In late Jan 2013 changed the robots.txt file to allow search engines to index the website. A week later I accidentally logged into the DEV website and also changed the robots.txt file to allow the search engines to index it. This obviously caused a duplicate content issue as both sites were identical. I realised what I had done a couple of days later and blocked the dev site from the search engines with the robots.txt file. Most of the pages from the dev site had been de-indexed from Google apart from 3, the home page (dev.rollerbannerscheap.co.uk, and two blog pages). The live site has 184 pages indexed in Google. So I thought the last 3 dev pages would disappear after a few weeks. I checked back late February and the 3 dev site pages were still indexed in Google. I decided to 301 redirect the dev site to the live site to tell Google to rank the live site and to ignore the dev site content. I also checked the robots.txt file on the dev site and this was blocking search engines too. But still the dev site is being found in Google wherever the live site should be found. When I do find the dev site in Google it displays this; Roller Banners Cheap » admin <cite>dev.rollerbannerscheap.co.uk/</cite><a id="srsl_0" class="pplsrsla" tabindex="0" data-ved="0CEQQ5hkwAA" data-url="http://dev.rollerbannerscheap.co.uk/" data-title="Roller Banners Cheap » admin" data-sli="srsl_0" data-ci="srslc_0" data-vli="srslcl_0" data-slg="webres"></a>A description for this result is not available because of this site's robots.txt – learn more.This is really affecting our clients SEO plan and we can't seem to remove the dev site or rank the live site in Google.Please can anyone help?
Technical SEO | | SO_UK0 -
Duplicate video content question
This is really two questions in one. 1. If we put a video on YouTube and on our site via Wistia, how would that affect our rankings/authority/credibility? Would we get punished for duplicate video content? 2. If we put a Wistia hosted video on our website twice, on two different pages, we would get hit for having duplicate content? Any other suggestions regarding hosting on Wistia and YouTube versus just Wistia for product videos would be much appreciated. Thank you!
Technical SEO | | ShawnHerrick1 -
Whats with the backslash in the url adding as duplicate content?
Is this a bug or something that needs to be addressed? If so, just use a redirect?
Technical SEO | | Boogily0