Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Can PDF be seen as duplicate content? If so, how to prevent it?
- 
					
					
					
					
 I see no reason why PDF couldn't be considered duplicate content but I haven't seen any threads about it. We publish loads of product documentation provided by manufacturers as well as White Papers and Case Studies. These give our customers and prospects a better idea off our solutions and help them along their buying process. However, I'm not sure if it would be better to make them non-indexable to prevent duplicate content issues. Clearly we would prefer a solutions where we benefit from to keywords in the documents. Any one has insight on how to deal with PDF provided by third parties? Thanks in advance. 
- 
					
					
					
					
 It looks like Google is not crawling tabs anymore, therefore if your pdf's are tabbed within pages, it might not be an issue: https://www.seroundtable.com/google-hidden-tab-content-seo-19489.html 
- 
					
					
					
					
 Sure, I understand - thanks EGOL 
- 
					
					
					
					
 I would like to give that to you but it is on a site that I don't share in forums. Sorry. 
- 
					
					
					
					
 Thanks EGOL That would be ideal. For a site that has multiple authors and with it being impractical to get a developer involved every time a web page / blog post and the pdf are created, is there a single line of code that could be used to accomplish this in .htaccess? If so, would you be able to show me an example please? 
- 
					
					
					
					
 I assigned rel=canonical to my PDFs using htaccess. Then, if anyone links to the PDFs the linkvalue gets passed to the webpage. 
- 
					
					
					
					
 Hi all I've been discussing the topic of making content available as both blog posts and pdf downloads today. Given that there is a lot of uncertainty and complexity around this issue of potential duplication, my plan is to house all the pdfs in a folder that we block with robots.txt Anyone agree / disagree with this approach? 
- 
					
					
					
					
 Unfortunately, there's no great way to have it both ways. If you want these pages to get indexed for the links, then they're potential duplicates. If Google filters them out, the links probably won't count. Worst case, it could cause Panda-scale problems. Honestly, I suspect the link value is minimal and outweighed by the risk, but it depends quite a bit on the scope of what you're doing and the general link profile of the site. 
- 
					
					
					
					
 I think you can set it to public or private (logged-in only) and even put a price-tag on it if you want. So yes setting it to private would help to eliminate the dup content issue, but it would also hide the links that I'm using to link-build. I would imagine that since this guide would link back to our original site that it would be no different than if someone were to copy the content from our site and link back to us with it, thus crediting us as the original source. Especially if we ensure to index it through GWMT before submitting to other platforms. Any good resources that delve into that? 
- 
					
					
					
					
 Potentially, but I'm honestly not sure how Scrid's pages are indexed. Don't you need to log in or something to actually see the content on Scribd? 
- 
					
					
					
					
 What about this instance: (A) I made an "ultimate guide to X" and posted it on my site as individual HTML pages for each chapter (B) I made a PDF version with the exact same content that people can download directly from the site (C) I uploaded the PDF to sites like Scribd.com to help distribute it further, and build links with the links that are embedded in the PDF. Would those all be dup content? Is (C) recommended or not? 
- 
					
					
					
					
 Thanks!. I am going to look into this. I'll let you know if I learn anything. 
- 
					
					
					
					
 If they duplicate your main content, I think the header-level canonical may be a good way to go. For the syndication scenario, it's tough, because then you're knocking those PDFs out of the rankings, potentially, in favor of someone else's content. Honestly, I've seen very few people deal with canonicalization for PDFs, and even those cases were small or obvious (like a page with the exact same content being outranked by the duplicate PDF). It's kind of uncharted territory. 
- 
					
					
					
					
 Thanks for all of your input Dr. Pete. The example that you use is almost exactly what I have - hundreds of .pdfs on a fifty page site. These .pdfs rank well in the SERPs, accumulate pagerank, and pass traffic and link value back to the main site through links embedded within the .pdf. The also have natural links from other domains. I don't want to block them or nofollow them butyour suggestion of using header directive sounds pretty good. 
- 
					
					
					
					
 Oh, sorry - so these PDFs aren't duplicates with your own web/HTML content so much as duplicates with the same PDFs on other websites? That's more like a syndication situation. It is possible that, if enough people post these PDFs, you could run into trouble, but I've never seen that. More likely, your versions just wouldn't rank. Theoretically, you could use the header-level canonical tag cross-domain, but I've honestly never seen that tested. If you're talking about a handful of PDFs, they're a small percentage of your overall indexed content, and that content is unique, I wouldn't worry too much. If you're talking about 100s of PDFs on a 50-page website, then I'd control it. Unfortunately, at that point, you'd probably have to put the PDFs in a folder and outright block it. You'd remove the risk, but you'd stop ranking on those PDFs as well. 
- 
					
					
					
					
 @EGOL: Can you expend a bit on your Author suggestion? I was wondering if there is a way to do rel=author for a pdf document. I don't know how to do it and don't know if it is possible. 
- 
					
					
					
					
 To make sure I understand what I'm reading: - PDFs don't usually rank as well as regular pages (although it is possible)
- It is possible to configure a canonical tag on a PDF
 My concern isn't that our PDFs may outrank the original content but rather getting slammed by Google for publishing them. Am right in thinking a canonical tag prevents to accumulate link juice? If so I would prefer to not use it, unless it leads to Google slamming. Any one has experienced Google retribution for publishing PDF coming from a 3rd party? @EGOL: Can you expend a bit on your Author suggestion? Thanks all! 
- 
					
					
					
					
 I think it's possible, but I've only seen it in cases that are a bit hard to disentangle. For example, I've seen a PDF outrank a duplicate piece of regular content when the regular content had other issues (including massive duplication with other, regular content). My gut feeling is that it's unusual. If you're concerned about it, you can canonicalize PDFs with the header-level canonical directive. It's a bit more technically complex than the standard HTML canonical tag: http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html I'm going to mark this as "Discussion", just in case anyone else has seen real-world examples. 
- 
					
					
					
					
 I am really interested in hearing what others have to say about this. I know that .pdfs can be very valuable content. They can be optimized, they rank in the SERPs, they accumulate PR and they can pass linkvalue. So, to me it would be a mistake to block them from the index... However, I see your point about dupe content... they could also be thin content. Will panda whack you for thin and dupes in your PDFs? How can canonical be used... what about author? Anybody know anything about this? 
- 
					
					
					
					
 Just like any other piece of duplicate content, you can use canonical link elements to specify the original piece of content (if there's indeed more than one identical piece). You could also block these types of files in the robots.txt, or use noindex-follow meta tags. Regards, Margarita 
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		Upper and lower case URLS coming up as duplicate content
 Hey guys and gals, I'm having a frustrating time with an issue. Our site has around 10 pages that are coming up as duplicate content/ duplicate title. I'm not sure what I can do to fix this. I was going to attempt to 301 direct the upper case to lower but I'm worried how this will affect our SEO. can anyone offer some insight on what I should be doing? Update: What I'm trying to figure out is what I should do for our URL's. For example, when I run an audit I'm getting two different pages: aaa.com/BusinessAgreement.com and also aaa.com/businessagreement.com. We don't have two pages but for some reason, Google thinks we do. Intermediate & Advanced SEO | | davidmac1
- 
		
		
		
		
		
		Duplicate content in Shopify - subsequent pages in collections
 Hello everyone! I hope an expert in this community can help me verify the canonical codes I'll add to our store is correct. Currently, in our Shopify store, the subsequent pages in the collections are not indexed by Google, however the canonical URL on these pages aren't pointing to the main collection page (page 1), e.g. The canonical URL of page 2, page 3 etc are used as canonical URLs instead of the first page of the collections. I have the canonical codes attached below, it would be much appreciated if an expert can urgently verify these codes are good to use and will solve the above issues? Thanks so much for your kind help in advance!! -----------------CODES BELOW--------------- <title><br /> {{ page_title }}{% if current_tags %} – tagged "{{ current_tags | join: ', ' }}"{% endif %}{% if current_page != 1 %} – Page {{ current_page }}{% endif %}{% unless page_title contains shop.name %} – {{ shop.name }}{% endunless %}<br /></title> Intermediate & Advanced SEO | | ycnetpro101
 {% if page_description %} {% endif %} {% if current_page != 1 %} {% else %} {% endif %}
 {% if template == 'collection' %}{% if collection %}
 {% if current_page == 1 %} {% endif %}
 {% if template == 'product' %}{% if product %} {% endif %}
 {% if template == 'collection' %}{% if collection %} {% endif %}0
- 
		
		
		
		
		
		SEM Rush & Duplicate content
 Hi SEMRush is flagging these pages as having duplicate content, but we have rel = next etc implemented: https://www.key.co.uk/en/key/brand/bott https://www.key.co.uk/en/key/brand/bott?page=2 Or is it being flagged as they're just really similar pages? Intermediate & Advanced SEO | | BeckyKey0
- 
		
		
		
		
		
		Directory with Duplicate content? what to do?
 Moz keeps finding loads of pages with duplicate content on my website. The problem is its a directory page to different locations. E.g if we were a clothes shop we would be listing our locations: www.sitename.com/locations/london www.sitename.com/locations/rome www.sitename.com/locations/germany The content on these pages is all the same, except for an embedded google map that shows the location of the place. The problem is that google thinks all these pages are duplicated content. Should i set a canonical link on every single page saying that www.sitename.com/locations/london is the main page? I don't know if i can use canonical links because the page content isn't identical because of the embedded map. Help would be appreciated. Thanks. Intermediate & Advanced SEO | | nchlondon0
- 
		
		
		
		
		
		Copying my Facebook content to website considered duplicate content?
 I write career advice on Facebook on a daily basis. On my homepage users can see the most recent 4-5 feeds (using FB social media plugin). I am thinking to create a page on my website where visitors can see all my previous FB feeds. Would this be considered duplicate content if I copy paste the info, but if I use a Facebook social media plugin then it is not considered duplicate content? I am working on increasing content on my website and feel incorporating FB feeds would make sense. thank you Intermediate & Advanced SEO | | knielsen0
- 
		
		
		
		
		
		Could you use a robots.txt file to disalow a duplicate content page from being crawled?
 A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution? Intermediate & Advanced SEO | | gregelwell0
- 
		
		
		
		
		
		Duplicate content on ecommerce sites
 I just want to confirm something about duplicate content. On an eCommerce site, if the meta-titles, meta-descriptions and product descriptions are all unique, yet a big chunk at the bottom (featuring "why buy with us" etc) is copied across all product pages, would each page be penalised, or not indexed, for duplicate content? Does the whole page need to be a duplicate to be worried about this, or would this large chunk of text, bigger than the product description, have an effect on the page. If this would be a problem, what are some ways around it? Because the content is quite powerful, and is relavent to all products... Cheers, Intermediate & Advanced SEO | | Creode0
- 
		
		
		
		
		
		Concerns about duplicate content issues with australian and us version of website
 My company has an ecommerce website that's been online for about 5 years. The url is www.betterbraces.com. We're getting ready to launch an australian version of the website and the url will be www.betterbraces.com.au. The australian website will have the same look as the US website and will contain about 200 of the same products that are featured on the US website. The only major difference between the two websites is the price that is charged for the products. The australian website will be hosted on the same server as the US website. To ensure Australians don't purchase from the US site we are going to have a geo redirect in place that sends anyone with a AU ip address to the australian website. I am concerned that the australian website is going to have duplicate content issues. However, I'm not sure if the fact that the domains are so similar coupled with the redirect will help the search engines understand that these sites are related. I would appreciate any recommendations on how to handle this situation to ensure oue rankings in the search engines aren't penalized. Thanks in advance for your help. Alison French Intermediate & Advanced SEO | | djo-2836690
 
			
		 
			
		 
			
		 
			
		 
			
		 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				