Can PDF be seen as duplicate content? If so, how to prevent it?
-
I see no reason why PDF couldn't be considered duplicate content but I haven't seen any threads about it.
We publish loads of product documentation provided by manufacturers as well as White Papers and Case Studies. These give our customers and prospects a better idea off our solutions and help them along their buying process.
However, I'm not sure if it would be better to make them non-indexable to prevent duplicate content issues. Clearly we would prefer a solutions where we benefit from to keywords in the documents.
Any one has insight on how to deal with PDF provided by third parties?
Thanks in advance.
-
It looks like Google is not crawling tabs anymore, therefore if your pdf's are tabbed within pages, it might not be an issue: https://www.seroundtable.com/google-hidden-tab-content-seo-19489.html
-
Sure, I understand - thanks EGOL
-
I would like to give that to you but it is on a site that I don't share in forums. Sorry.
-
Thanks EGOL
That would be ideal.
For a site that has multiple authors and with it being impractical to get a developer involved every time a web page / blog post and the pdf are created, is there a single line of code that could be used to accomplish this in .htaccess?
If so, would you be able to show me an example please?
-
I assigned rel=canonical to my PDFs using htaccess.
Then, if anyone links to the PDFs the linkvalue gets passed to the webpage.
-
Hi all
I've been discussing the topic of making content available as both blog posts and pdf downloads today.
Given that there is a lot of uncertainty and complexity around this issue of potential duplication, my plan is to house all the pdfs in a folder that we block with robots.txt
Anyone agree / disagree with this approach?
-
Unfortunately, there's no great way to have it both ways. If you want these pages to get indexed for the links, then they're potential duplicates. If Google filters them out, the links probably won't count. Worst case, it could cause Panda-scale problems. Honestly, I suspect the link value is minimal and outweighed by the risk, but it depends quite a bit on the scope of what you're doing and the general link profile of the site.
-
I think you can set it to public or private (logged-in only) and even put a price-tag on it if you want. So yes setting it to private would help to eliminate the dup content issue, but it would also hide the links that I'm using to link-build.
I would imagine that since this guide would link back to our original site that it would be no different than if someone were to copy the content from our site and link back to us with it, thus crediting us as the original source. Especially if we ensure to index it through GWMT before submitting to other platforms. Any good resources that delve into that?
-
Potentially, but I'm honestly not sure how Scrid's pages are indexed. Don't you need to log in or something to actually see the content on Scribd?
-
What about this instance:
(A) I made an "ultimate guide to X" and posted it on my site as individual HTML pages for each chapter
(B) I made a PDF version with the exact same content that people can download directly from the site
(C) I uploaded the PDF to sites like Scribd.com to help distribute it further, and build links with the links that are embedded in the PDF.
Would those all be dup content? Is (C) recommended or not?
-
Thanks!. I am going to look into this. I'll let you know if I learn anything.
-
If they duplicate your main content, I think the header-level canonical may be a good way to go. For the syndication scenario, it's tough, because then you're knocking those PDFs out of the rankings, potentially, in favor of someone else's content.
Honestly, I've seen very few people deal with canonicalization for PDFs, and even those cases were small or obvious (like a page with the exact same content being outranked by the duplicate PDF). It's kind of uncharted territory.
-
Thanks for all of your input Dr. Pete. The example that you use is almost exactly what I have - hundreds of .pdfs on a fifty page site. These .pdfs rank well in the SERPs, accumulate pagerank, and pass traffic and link value back to the main site through links embedded within the .pdf. The also have natural links from other domains. I don't want to block them or nofollow them butyour suggestion of using header directive sounds pretty good.
-
Oh, sorry - so these PDFs aren't duplicates with your own web/HTML content so much as duplicates with the same PDFs on other websites?
That's more like a syndication situation. It is possible that, if enough people post these PDFs, you could run into trouble, but I've never seen that. More likely, your versions just wouldn't rank. Theoretically, you could use the header-level canonical tag cross-domain, but I've honestly never seen that tested.
If you're talking about a handful of PDFs, they're a small percentage of your overall indexed content, and that content is unique, I wouldn't worry too much. If you're talking about 100s of PDFs on a 50-page website, then I'd control it. Unfortunately, at that point, you'd probably have to put the PDFs in a folder and outright block it. You'd remove the risk, but you'd stop ranking on those PDFs as well.
-
@EGOL: Can you expend a bit on your Author suggestion?
I was wondering if there is a way to do rel=author for a pdf document. I don't know how to do it and don't know if it is possible.
-
To make sure I understand what I'm reading:
- PDFs don't usually rank as well as regular pages (although it is possible)
- It is possible to configure a canonical tag on a PDF
My concern isn't that our PDFs may outrank the original content but rather getting slammed by Google for publishing them.
Am right in thinking a canonical tag prevents to accumulate link juice? If so I would prefer to not use it, unless it leads to Google slamming.
Any one has experienced Google retribution for publishing PDF coming from a 3rd party?
@EGOL: Can you expend a bit on your Author suggestion?
Thanks all!
-
I think it's possible, but I've only seen it in cases that are a bit hard to disentangle. For example, I've seen a PDF outrank a duplicate piece of regular content when the regular content had other issues (including massive duplication with other, regular content). My gut feeling is that it's unusual.
If you're concerned about it, you can canonicalize PDFs with the header-level canonical directive. It's a bit more technically complex than the standard HTML canonical tag:
http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html
I'm going to mark this as "Discussion", just in case anyone else has seen real-world examples.
-
I am really interested in hearing what others have to say about this.
I know that .pdfs can be very valuable content. They can be optimized, they rank in the SERPs, they accumulate PR and they can pass linkvalue. So, to me it would be a mistake to block them from the index...
However, I see your point about dupe content... they could also be thin content. Will panda whack you for thin and dupes in your PDFs?
How can canonical be used... what about author?
Anybody know anything about this?
-
Just like any other piece of duplicate content, you can use canonical link elements to specify the original piece of content (if there's indeed more than one identical piece). You could also block these types of files in the robots.txt, or use noindex-follow meta tags.
Regards,
Margarita
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Please provide solution for my website? Duplicate content Problem
I have 2 Domains with the same name with same content. How to solve that problem? Do I need to change the content from my main website. My Hosting is having different plans, but with the same features. So many pages were having the same content, and it is not possible to change the content, what is the solution for that? Please let me know how to solve that issue?
Intermediate & Advanced SEO | | Alexa.Hill0 -
Duplicate Content with URL Parameters
Moz is picking up a large quantity of duplicate content, consists mainly of URL parameters like ,pricehigh & ,pricelow etc (for page sorting). Google has indexed a large number of the pages (not sure how many), not sure how many of them are ranking for search terms we need. I have added the parameters into Google Webmaster tools And set to 'let google decide', However Google still sees it as duplicate content. Is it a problem that we need to address? Or could it do more harm than good in trying to fix it? Has anyone had any experience? Thanks
Intermediate & Advanced SEO | | seoman100 -
Duplicate content in external domains
Hi,
Intermediate & Advanced SEO | | teconsite
I have been asking about this case before, but now my question is different.
We have a new school that offers courses and programs . Its website is quite new (just a five months old) It is very common between these schools to publish the courses and programs in training portals to promote those courses and to increase the visibility of them. As the website is really new, I found when I was doing the technical audit, that when I googled a text snipped from the site, the new school website was being omitted, and instead, the course portals are being shown. Of course, I know that the best recommendation would be to create a different content for that purpose, but I would like to explore if there is more options. Most of those portals doesn't allow to place a link to the website in the content and not to mention canonical. Of course most of them are older than the new website and their authority is higher. so,... with this situation, I think the only solution is to create a different content for the website and for the portals.
I was thinking that maybe, If we create the content first in the new website, send it to the index, and wait for google to index it, and then send the content to the portals, maybe we would have more opportunites to not be ommited by Google in search results. What do you think? Thank you!0 -
What is the better of 2 evils? Duplicate Product Descriptions or Thin Content?
It is quite labour intensive to come up with product descriptions for all of our product range ... +2500 products, in English and Spanish... When we started, we copy pasted manufacturer descriptions so they are not unique (on the web), plus some of them repeat each other - We are getting unique content written but its going to be a long process, so, what is the best of 2 evils, lots of duplicate non unique content or remove it and get a very small phrase from the database of unique thin content? Thanks!
Intermediate & Advanced SEO | | bjs20101 -
Can i get banned for my content?
Last night all our indexed pages are gone from google. Completely deindexed - banned. Links could not cause it, all of them are related, anchors diversified and spam is never used. Content is the same like our other website has, just some small changes. First stronger website is working as usual. So can it be that duplicate content caused a complete ban? (Website is 6 months old. Content has never been properly indexed, due to same reasons i think. Last week we made changes, ant it started to get indexed quite well until tonight..)
Intermediate & Advanced SEO | | bele0 -
404 for duplicate content?
Sorry, I think this is my third question today... But I have a lot of duplicated content on my site. I use joomla so theres a lot of unintentional duplication. For example, www.mysite.com/index.php exists, etc. Up till now, I thought I had to 301 redirect or rel=canonical these "duplicated pages." However, can I just 404 it? Is there anything wrong with this rpactice in regards to SEO?
Intermediate & Advanced SEO | | waltergah0 -
Is this duplicate content?
My client has several articles and pages that have 2 different URLs For example: /bc-blazes-construction-trail is the same article as: /article.cfm?intDocID=22572 I was not sure if this was duplicate content or not ... Or if I should be putting "/article.cfm" into the robots.txt file or not.. if anyone could help me out, that would be awesome! Thanks 🙂
Intermediate & Advanced SEO | | ATMOSMarketing560 -
Duplicate Content from Article Directories
I have a small client with a website PR2, 268 links from 21 root domains with mozTrusts 5.5, MozRank 4.5 However whenever I check in google for the amount of link: Google always give the response none. My client has a blog and many articles on the blog. However they have submitted their blog article every time to article directories as well, plain and simle creating duplicate and content. Is this the reason why their link: is coming up as none? Is there something to correct the situation?
Intermediate & Advanced SEO | | danielkamen0