Can PDF be seen as duplicate content? If so, how to prevent it?
-
I see no reason why PDF couldn't be considered duplicate content but I haven't seen any threads about it.
We publish loads of product documentation provided by manufacturers as well as White Papers and Case Studies. These give our customers and prospects a better idea off our solutions and help them along their buying process.
However, I'm not sure if it would be better to make them non-indexable to prevent duplicate content issues. Clearly we would prefer a solutions where we benefit from to keywords in the documents.
Any one has insight on how to deal with PDF provided by third parties?
Thanks in advance.
-
It looks like Google is not crawling tabs anymore, therefore if your pdf's are tabbed within pages, it might not be an issue: https://www.seroundtable.com/google-hidden-tab-content-seo-19489.html
-
Sure, I understand - thanks EGOL
-
I would like to give that to you but it is on a site that I don't share in forums. Sorry.
-
Thanks EGOL
That would be ideal.
For a site that has multiple authors and with it being impractical to get a developer involved every time a web page / blog post and the pdf are created, is there a single line of code that could be used to accomplish this in .htaccess?
If so, would you be able to show me an example please?
-
I assigned rel=canonical to my PDFs using htaccess.
Then, if anyone links to the PDFs the linkvalue gets passed to the webpage.
-
Hi all
I've been discussing the topic of making content available as both blog posts and pdf downloads today.
Given that there is a lot of uncertainty and complexity around this issue of potential duplication, my plan is to house all the pdfs in a folder that we block with robots.txt
Anyone agree / disagree with this approach?
-
Unfortunately, there's no great way to have it both ways. If you want these pages to get indexed for the links, then they're potential duplicates. If Google filters them out, the links probably won't count. Worst case, it could cause Panda-scale problems. Honestly, I suspect the link value is minimal and outweighed by the risk, but it depends quite a bit on the scope of what you're doing and the general link profile of the site.
-
I think you can set it to public or private (logged-in only) and even put a price-tag on it if you want. So yes setting it to private would help to eliminate the dup content issue, but it would also hide the links that I'm using to link-build.
I would imagine that since this guide would link back to our original site that it would be no different than if someone were to copy the content from our site and link back to us with it, thus crediting us as the original source. Especially if we ensure to index it through GWMT before submitting to other platforms. Any good resources that delve into that?
-
Potentially, but I'm honestly not sure how Scrid's pages are indexed. Don't you need to log in or something to actually see the content on Scribd?
-
What about this instance:
(A) I made an "ultimate guide to X" and posted it on my site as individual HTML pages for each chapter
(B) I made a PDF version with the exact same content that people can download directly from the site
(C) I uploaded the PDF to sites like Scribd.com to help distribute it further, and build links with the links that are embedded in the PDF.
Would those all be dup content? Is (C) recommended or not?
-
Thanks!. I am going to look into this. I'll let you know if I learn anything.
-
If they duplicate your main content, I think the header-level canonical may be a good way to go. For the syndication scenario, it's tough, because then you're knocking those PDFs out of the rankings, potentially, in favor of someone else's content.
Honestly, I've seen very few people deal with canonicalization for PDFs, and even those cases were small or obvious (like a page with the exact same content being outranked by the duplicate PDF). It's kind of uncharted territory.
-
Thanks for all of your input Dr. Pete. The example that you use is almost exactly what I have - hundreds of .pdfs on a fifty page site. These .pdfs rank well in the SERPs, accumulate pagerank, and pass traffic and link value back to the main site through links embedded within the .pdf. The also have natural links from other domains. I don't want to block them or nofollow them butyour suggestion of using header directive sounds pretty good.
-
Oh, sorry - so these PDFs aren't duplicates with your own web/HTML content so much as duplicates with the same PDFs on other websites?
That's more like a syndication situation. It is possible that, if enough people post these PDFs, you could run into trouble, but I've never seen that. More likely, your versions just wouldn't rank. Theoretically, you could use the header-level canonical tag cross-domain, but I've honestly never seen that tested.
If you're talking about a handful of PDFs, they're a small percentage of your overall indexed content, and that content is unique, I wouldn't worry too much. If you're talking about 100s of PDFs on a 50-page website, then I'd control it. Unfortunately, at that point, you'd probably have to put the PDFs in a folder and outright block it. You'd remove the risk, but you'd stop ranking on those PDFs as well.
-
@EGOL: Can you expend a bit on your Author suggestion?
I was wondering if there is a way to do rel=author for a pdf document. I don't know how to do it and don't know if it is possible.
-
To make sure I understand what I'm reading:
- PDFs don't usually rank as well as regular pages (although it is possible)
- It is possible to configure a canonical tag on a PDF
My concern isn't that our PDFs may outrank the original content but rather getting slammed by Google for publishing them.
Am right in thinking a canonical tag prevents to accumulate link juice? If so I would prefer to not use it, unless it leads to Google slamming.
Any one has experienced Google retribution for publishing PDF coming from a 3rd party?
@EGOL: Can you expend a bit on your Author suggestion?
Thanks all!
-
I think it's possible, but I've only seen it in cases that are a bit hard to disentangle. For example, I've seen a PDF outrank a duplicate piece of regular content when the regular content had other issues (including massive duplication with other, regular content). My gut feeling is that it's unusual.
If you're concerned about it, you can canonicalize PDFs with the header-level canonical directive. It's a bit more technically complex than the standard HTML canonical tag:
http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html
I'm going to mark this as "Discussion", just in case anyone else has seen real-world examples.
-
I am really interested in hearing what others have to say about this.
I know that .pdfs can be very valuable content. They can be optimized, they rank in the SERPs, they accumulate PR and they can pass linkvalue. So, to me it would be a mistake to block them from the index...
However, I see your point about dupe content... they could also be thin content. Will panda whack you for thin and dupes in your PDFs?
How can canonical be used... what about author?
Anybody know anything about this?
-
Just like any other piece of duplicate content, you can use canonical link elements to specify the original piece of content (if there's indeed more than one identical piece). You could also block these types of files in the robots.txt, or use noindex-follow meta tags.
Regards,
Margarita
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can you rank without 10 x content
If I create a page about a "Normandy bike tour "and present the same things (pictures, hotels, dates, day by day itinerary, clients reviews, map) as my competitors can I still rank ? Or do I need to add something totally that my competitors don't have on their webpages to rank and compete ? Thank you,
Intermediate & Advanced SEO | | seoanalytics0 -
Geographic site clones and duplicate content penalties
We sell wedding garters, niche I know! We have a site (weddinggarterco.com) that ranks very well in the UK and sell a lot to the USA despite it's rudimentary currency functions (Shopify makes US customers checkout in £gbp; not helpful to conversions). To improve this I built a clone (theweddinggarterco.com) and have faked a kind of location selector top right. Needless to say a lot of content on this site is VERY similar to the UK version. My questions are... 1. Is this likely to stop me ranking the USA site? 2. Is this likely to harm my UK rankings? Any thoughts very welcome! Thanks. Mat
Intermediate & Advanced SEO | | mat20150 -
Glossary index and individual pages create duplicate content. How much might this hurt me?
I've got a glossary on my site with an index page for each letter of the alphabet that has a definition. So the M section lists every definition (the whole definition). But each definition also has its own individual page (and we link to those pages internally so the user doesn't have to hunt down the entire M page). So I definitely have duplicate content ... 112 instances (112 terms). Maybe it's not so bad because each definition is just a short paragraph(?) How much does this hurt my potential ranking for each definition? How much does it hurt my site overall? Am I better off making the individual pages no-index? or canonicalizing them?
Intermediate & Advanced SEO | | LeadSEOlogist0 -
How to resolve duplicate content issues when using Geo-targeted Subfolders to seperate US and CAN
A client of mine is about to launch into the USA market (currently only operating in Canada) and they are trying to find the best way to geo-target. We recommended they go with the geo-targeted subfolder approach (___.com and ___.com/ca). I'm looking for any ways to assist in not getting these pages flagged for duplicate content. Your help is greatly appreciated. Thanks!
Intermediate & Advanced SEO | | jyoung2220 -
News sites & Duplicate content
Hi SEOMoz I would like to know, in your opinion and according to 'industry' best practice, how do you get around duplicate content on a news site if all news sites buy their "news" from a central place in the world? Let me give you some more insight to what I am talking about. My client has a website that is purely focuses on news. Local news in one of the African Countries to be specific. Now, what we noticed the past few months is that the site is not ranking to it's full potential. We investigated, checked our keyword research, our site structure, interlinking, site speed, code to html ratio you name it we checked it. What we did pic up when looking at duplicate content is that the site is flagged by Google as duplicated, BUT so is most of the news sites because they all get their content from the same place. News get sold by big companies in the US (no I'm not from the US so cant say specifically where it is from) and they usually have disclaimers with these content pieces that you can't change the headline and story significantly, so we do have quite a few journalists that rewrites the news stories, they try and keep it as close to the original as possible but they still change it to fit our targeted audience - where my second point comes in. Even though the content has been duplicated, our site is more relevant to what our users are searching for than the bigger news related websites in the world because we do hyper local everything. news, jobs, property etc. All we need to do is get off this duplicate content issue, in general we rewrite the content completely to be unique if a site has duplication problems, but on a media site, im a little bit lost. Because I haven't had something like this before. Would like to hear some thoughts on this. Thanks,
Intermediate & Advanced SEO | | 360eight-SEO
Chris Captivate0 -
Duplicate Content Question
My client's website is for an organization that is part of a larger organization - which has it's own website. We were given permission to use content from the larger organization's site on my client's redesigned site. The SEs will deem this as duplicate content, right? I can "re-write" the content for the new site, but it will still be closely based on the original content from the larger organization's site, due to the scientific/medical nature of the subject material. Is there a way around this dilemma so I do not get penalized? Thanks!
Intermediate & Advanced SEO | | Mills1 -
Duplicate content for area listings
Hi, I was slightly affected by the panda update on the 14th oct generaly dropping by about 5-8 spots in the serps for my main keywords, since then I've been giving my site a good looking over. On a site I've got city listings urls for certain widget companys, the thing is many areas and thus urls will have the same company listed. What would be the best way of solving this duplicate content as google may be seeing it? I was thinking of one page per company and prominenly listing the areas they operate so still hopefully get ranked for area searches. But i'd be losing the city names in the url as I've got them now for example: mywidgetsite.com/findmagicwidgets/new-york.html mywidgetsite.com/findmagicwidgets/atlanta.html Any ideas on how best to proceed? Cheers!
Intermediate & Advanced SEO | | NetGeek0 -
"Duplicate" Page Titles and Content
Hi All, This is a rather lengthy one, so please bear with me! SEOmoz has recently crawled 10,000 webpages from my site, FrenchEntree, and has returned 8,000 errors of duplicate page content. The main reason I have so many is because of the directories I have on site. The site is broken down into 2 levels of hierachy. "Weblets" and "Articles". A weblet is a landing page, and articles are created within these weblets. Weblets can hold any number of articles - 0 - 1,000,000 (in theory) and an article must be assigned to a weblet in order for it to work. Here's how it roughly looks in URL form - http://www.mysite.com/[weblet]/[articleID]/ Now; our directory results pages are weblets with standard content in the left and right hand columns, but the information in the middle column is pulled in from our directory database following a user query. This happens by adding the query string to the end of the URL. We have 3 main directory databases, but perhaps around 100 weblets promoting various 'canned' queries that users may want to navigate straight into. However, any one of the 100 directory promoting weblets could return any query from the parent directory database with the correct query string. The problem with this method (as pointed out by the 8,000 errors) is that each possible permutation of search is considered to be it's own URL, and therefore, it's own page. The example I will use is the first alphabetically. "Activity Holidays in France": http://www.frenchentree.com/activity-holidays-france/ - This link shows you a results weblet without the query at the end, and therefore only displays the left and right hand columns as populated. http://www.frenchentree.com/activity-holidays-france/home.asp?CategoryFilter= - This link shows you the same weblet with the an 'open' query on the end. I.e. display all results from this database. Listings are displayed in the middle. There are around 500 different URL permutations for this weblet alone when you take into account the various categories and cities a user may want to search in. What I'd like to do is to prevent SEOmoz (and therefore search engines) from counting each individual query permutation as a unique page, without harming the visibility that the directory results received in SERPs. We often appear in the top 5 for quite competitive keywords and we'd like it to stay that way. I also wouldn't want the search engine results to only display (and therefore direct the user through to) an empty weblet by some sort of robot exclusion or canonical classification. Does anyone have any advice on how best to remove the "duplication" problem, whilst keeping the search visibility? All advice welcome. Thanks Matt
Intermediate & Advanced SEO | | Horizon0