Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Can PDF be seen as duplicate content? If so, how to prevent it?
-
I see no reason why PDF couldn't be considered duplicate content but I haven't seen any threads about it.
We publish loads of product documentation provided by manufacturers as well as White Papers and Case Studies. These give our customers and prospects a better idea off our solutions and help them along their buying process.
However, I'm not sure if it would be better to make them non-indexable to prevent duplicate content issues. Clearly we would prefer a solutions where we benefit from to keywords in the documents.
Any one has insight on how to deal with PDF provided by third parties?
Thanks in advance.
-
It looks like Google is not crawling tabs anymore, therefore if your pdf's are tabbed within pages, it might not be an issue: https://www.seroundtable.com/google-hidden-tab-content-seo-19489.html
-
Sure, I understand - thanks EGOL
-
I would like to give that to you but it is on a site that I don't share in forums. Sorry.
-
Thanks EGOL
That would be ideal.
For a site that has multiple authors and with it being impractical to get a developer involved every time a web page / blog post and the pdf are created, is there a single line of code that could be used to accomplish this in .htaccess?
If so, would you be able to show me an example please?
-
I assigned rel=canonical to my PDFs using htaccess.
Then, if anyone links to the PDFs the linkvalue gets passed to the webpage.
-
Hi all
I've been discussing the topic of making content available as both blog posts and pdf downloads today.
Given that there is a lot of uncertainty and complexity around this issue of potential duplication, my plan is to house all the pdfs in a folder that we block with robots.txt
Anyone agree / disagree with this approach?
-
Unfortunately, there's no great way to have it both ways. If you want these pages to get indexed for the links, then they're potential duplicates. If Google filters them out, the links probably won't count. Worst case, it could cause Panda-scale problems. Honestly, I suspect the link value is minimal and outweighed by the risk, but it depends quite a bit on the scope of what you're doing and the general link profile of the site.
-
I think you can set it to public or private (logged-in only) and even put a price-tag on it if you want. So yes setting it to private would help to eliminate the dup content issue, but it would also hide the links that I'm using to link-build.
I would imagine that since this guide would link back to our original site that it would be no different than if someone were to copy the content from our site and link back to us with it, thus crediting us as the original source. Especially if we ensure to index it through GWMT before submitting to other platforms. Any good resources that delve into that?
-
Potentially, but I'm honestly not sure how Scrid's pages are indexed. Don't you need to log in or something to actually see the content on Scribd?
-
What about this instance:
(A) I made an "ultimate guide to X" and posted it on my site as individual HTML pages for each chapter
(B) I made a PDF version with the exact same content that people can download directly from the site
(C) I uploaded the PDF to sites like Scribd.com to help distribute it further, and build links with the links that are embedded in the PDF.
Would those all be dup content? Is (C) recommended or not?
-
Thanks!. I am going to look into this. I'll let you know if I learn anything.
-
If they duplicate your main content, I think the header-level canonical may be a good way to go. For the syndication scenario, it's tough, because then you're knocking those PDFs out of the rankings, potentially, in favor of someone else's content.
Honestly, I've seen very few people deal with canonicalization for PDFs, and even those cases were small or obvious (like a page with the exact same content being outranked by the duplicate PDF). It's kind of uncharted territory.
-
Thanks for all of your input Dr. Pete. The example that you use is almost exactly what I have - hundreds of .pdfs on a fifty page site. These .pdfs rank well in the SERPs, accumulate pagerank, and pass traffic and link value back to the main site through links embedded within the .pdf. The also have natural links from other domains. I don't want to block them or nofollow them butyour suggestion of using header directive sounds pretty good.
-
Oh, sorry - so these PDFs aren't duplicates with your own web/HTML content so much as duplicates with the same PDFs on other websites?
That's more like a syndication situation. It is possible that, if enough people post these PDFs, you could run into trouble, but I've never seen that. More likely, your versions just wouldn't rank. Theoretically, you could use the header-level canonical tag cross-domain, but I've honestly never seen that tested.
If you're talking about a handful of PDFs, they're a small percentage of your overall indexed content, and that content is unique, I wouldn't worry too much. If you're talking about 100s of PDFs on a 50-page website, then I'd control it. Unfortunately, at that point, you'd probably have to put the PDFs in a folder and outright block it. You'd remove the risk, but you'd stop ranking on those PDFs as well.
-
@EGOL: Can you expend a bit on your Author suggestion?
I was wondering if there is a way to do rel=author for a pdf document. I don't know how to do it and don't know if it is possible.
-
To make sure I understand what I'm reading:
- PDFs don't usually rank as well as regular pages (although it is possible)
- It is possible to configure a canonical tag on a PDF
My concern isn't that our PDFs may outrank the original content but rather getting slammed by Google for publishing them.
Am right in thinking a canonical tag prevents to accumulate link juice? If so I would prefer to not use it, unless it leads to Google slamming.
Any one has experienced Google retribution for publishing PDF coming from a 3rd party?
@EGOL: Can you expend a bit on your Author suggestion?
Thanks all!
-
I think it's possible, but I've only seen it in cases that are a bit hard to disentangle. For example, I've seen a PDF outrank a duplicate piece of regular content when the regular content had other issues (including massive duplication with other, regular content). My gut feeling is that it's unusual.
If you're concerned about it, you can canonicalize PDFs with the header-level canonical directive. It's a bit more technically complex than the standard HTML canonical tag:
http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html
I'm going to mark this as "Discussion", just in case anyone else has seen real-world examples.
-
I am really interested in hearing what others have to say about this.
I know that .pdfs can be very valuable content. They can be optimized, they rank in the SERPs, they accumulate PR and they can pass linkvalue. So, to me it would be a mistake to block them from the index...
However, I see your point about dupe content... they could also be thin content. Will panda whack you for thin and dupes in your PDFs?
How can canonical be used... what about author?
Anybody know anything about this?
-
Just like any other piece of duplicate content, you can use canonical link elements to specify the original piece of content (if there's indeed more than one identical piece). You could also block these types of files in the robots.txt, or use noindex-follow meta tags.
Regards,
Margarita
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Will I be flagged for duplicate content by Google?
Hi Moz community, Had a question regarding duplicate content that I can't seem to find the answer to on Google. My agency is working on a large number of franchisee websites (over 40) for one client, a print franchise, that wants a refresh of new copy and SEO. Each print shop has their own 'microsite', though all services and products are the same, the only difference being the location. Each microsite has its own unique domain. To avoid writing the same content over and over in 40+ variations, would all the websites be flagged by Google for duplicate content if we were to use the same base copy, with the only changes being to the store locations (i.e. where we mention Toronto print shop on one site may change to Kelowna print shop on another)? Since the print franchise owns all the domains, I'm wondering if that would be a problem since the sites aren't really competing with one another. Any input would be greatly appreciated. Thanks again!
Intermediate & Advanced SEO | | EdenPrez0 -
Same product in different categories and duplicate content issues
Hi,I have some questions related to duplicate content on e-commerce websites. 1)If a single product goes to multiple categories (eg. A black elegant dress could be listed in two categories like "black dresses" and "elegant dresses") is it considered duplicate content even if the product url is unique? e.g www.website.com/black-dresses/black-elegant-dress duplicated> same content from two different paths www.website.com/elegant-dresses/black-elegant-dress duplicated> same content from two different paths www.website.com/black-elegant-dress unique url > this is the way my products urls look like Does google perceive this as duplicated content? The path to the content is only one, so it shouldn't be seen as duplicated content, though the product is repeated in different categories.This is the most important concern I actually have. It is a small thing but if I set this wrong all website would be affected and thus penalised, so I need to know how I can handle it. 2- I am using wordpress + woocommerce. The website is built with categories and subcategories. When I create a product in the product page backend is it advisable to select just the lowest subcategory or is it better to select both main category and subcategory in which the product belongs? I usually select the subcategory alone. Looking forward to your reply and suggestions. thanks
Intermediate & Advanced SEO | | cinzia091 -
Can Google read content that is hidden under a "Read More" area?
For example, when a person first lands on a given page, they see a collapsed paragraph but if they want to gather more information they press the "read more" and it expands to reveal the full paragraph. Does Google crawl the full paragraph or just the shortened version? In the same vein, what if you have a text box that contains three different tabs. For example, you're selling a product that has a text box with overview, instructions & ingredients tabs all housed under the same URL. Does Google crawl all three tabs? Thanks for your insight!
Intermediate & Advanced SEO | | jlo76130 -
How can I prevent duplicate pages being indexed because of load balancer (hosting)?
The site that I am optimising has a problem with duplicate pages being indexed as a result of the load balancer (which is required and set up by the hosting company). The load balancer passes the site through to 2 different URLs: www.domain.com www2.domain.com Some how, Google have indexed 2 of the same URLs (which I was obviously hoping they wouldn't) - the first on www and the second on www2. The hosting is a mirror image of each other (www and www2), meaning I can't upload a robots.txt to the root of www2.domain.com disallowing all. Also, I can't add a canonical script into the website header of www2.domain.com pointing the individual URLs through to www.domain.com etc. Any suggestions as to how I can resolve this issue would be greatly appreciated!
Intermediate & Advanced SEO | | iam-sold0 -
Best practice for duplicate website content: same root domain name but different extension
Hi there I have a new client who has two websites: http://www.bayofislandsteambuilding.co.nz
Intermediate & Advanced SEO | | turnbullholdingsltd
http://www.bayofislandsteambuilding.org.nz They are the same in every regard apart from the domain extension (.co.nz & .org.nz) which is likely to be causing them issues with Google ranking given the huge amount of duplicate content. What is the best practice approach to fixing this? Normally, if I was starting from scratch, I would set one of the extensions as an alias which redirects to the main domain. Thanks in advance. Laurie0 -
International SEO - cannibalisation and duplicate content
Hello all, I look after (in house) 3 domains for one niche travel business across three TLDs: .com .com.au and co.uk and a fourth domain on a co.nz TLD which was recently removed from Googles index. Symptoms: For the past 12 months we have been experiencing canibalisation in the SERPs (namely .com.au being rendered in .com) and Panda related ranking devaluations between our .com site and com.au site. Around 12 months ago the .com TLD was hit hard (80% drop in target KWs) by Panda (probably) and we began to action the below changes. Around 6 weeks ago our .com TLD saw big overnight increases in rankings (to date a 70% averaged increase). However, almost to the same percentage we saw in the .com TLD we suffered significant drops in our .com.au rankings. Basically Google seemed to switch its attention from .com TLD to the .com.au TLD. Note: Each TLD is over 6 years old, we've never proactively gone after links (Penguin) and have always aimed for quality in an often spammy industry. **Have done: ** Adding HREF LANG markup to all pages on all domain Each TLD uses local vernacular e.g for the .com site is American Each TLD has pricing in the regional currency Each TLD has details of the respective local offices, the copy references the lacation, we have significant press coverage in each country like The Guardian for our .co.uk site and Sydney Morning Herlad for our Australia site Targeting each site to its respective market in WMT Each TLDs core-pages (within 3 clicks of the primary nav) are 100% unique We're continuing to re-write and publish unique content to each TLD on a weekly basis As the .co.nz site drove such little traffic re-wrting we added no-idex and the TLD has almost compelte dissapread (16% of pages remain) from the SERPs. XML sitemaps Google + profile for each TLD **Have not done: ** Hosted each TLD on a local server Around 600 pages per TLD are duplicated across all TLDs (roughly 50% of all content). These are way down the IA but still duplicated. Images/video sources from local servers Added address and contact details using SCHEMA markup Any help, advice or just validation on this subject would be appreciated! Kian
Intermediate & Advanced SEO | | team_tic1 -
How do I geo-target continents & avoid duplicate content?
Hi everyone, We have a website which will have content tailored for a few locations: USA: www.site.com
Intermediate & Advanced SEO | | AxialDev
Europe EN: www.site.com/eu
Canada FR: www.site.com/fr-ca Link hreflang and the GWT option are designed for countries. I expect a fair amount of duplicate content; the only differences will be in product selection and prices. What are my options to tell Google that it should serve www.site.com/eu in Europe instead of www.site.com? We are not targeting a particular country on that continent. Thanks!0 -
Artist Bios on Multiple Pages: Duplicate Content or not?
I am currently working on an eComm site for a company that sells art prints. On each print's page, there is a bio about the artist followed by a couple of paragraphs about the print. My concern is that some artists have hundreds of prints on this site, and the bio is reprinted on every page,which makes sense from a usability standpoint, but I am concerned that it will trigger a duplicate content penalty from Google. Some people are trying to convince me that Google won't penalize for this content, since the intent is not to game the SERPs. However, I'm not confident that this isn't being penalized already, or that it won't be in the near future. Because it is just a section of text that is duplicated, but the rest of the text on each page is original, I can't use the rel=canonical tag. I've thought about putting each artist bio into a graphic, but that is a huge undertaking, and not the most elegant solution. Could I put the bio on a separate page with only the artist's info and then place that data on each print page using an <iframe>and then put a noindex,nofollow in the robots.txt file?</p> <p>Is there a better solution? Is this effort even necessary?</p> <p>Thoughts?</p></iframe>
Intermediate & Advanced SEO | | sbaylor0