PDF on financial site that duplicates ~50% of site content

540SEO

I have a financial advisor client who has a downloadable PDF on his site that contains about 9 pages of good info. Problem is much of the content can also be found on individual pages of his site.

Is it best to noindex/follow the pdf? It would be great to let the few pages of original content be crawlable, but I'm concerned about the duplicate content aspect.

Thanks --

EGOL

This is what we have done with pdfs. Assign rel="canonical" in .htaccess.

We did this with a few hundred files and it took google a LONG time to find and credit them.

dmccarthy

You could set the header to noindex rather than rel=canonical

Valarlf

Personally I think it would be better not to index, it but if necessary, the index folder root seems like a good option

540SEO

Thanks. Anybody want to weigh in on where to rel=canonical to? Home page?

Valarlf

If you are using apache, you should put it on your .htaccess with this form

<filesmatch “my-file.pdf”="">Header set Link ‘<http: misite="" my-file.html="">; rel=”canonical”‘</http:></filesmatch>

Valarlf

I think the right way here is to put the rel canonical in PDF header http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html

540SEO

I thought the idea was to put rel=canonical on the duplicated page, to signal that "hey, this page may look like duplicate content, but please refer to this canonical URL"?

Looks like there is a pdf option for rel=canonical, I guess the question is, what page on the site to make canonical?

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394

Indicate the canonical version of a URL by responding with the Link rel="canonical" HTTP header. Adding rel="canonical" to the head section of a page is useful for HTML content, but it can't be used for PDFs and other file types indexed by Google Web Search. In these cases you can indicate a canonical URL by responding with the Link rel="canonical" HTTP header, like this (note that to use this option, you'll need to be able to configure your server):

Link: <http: www.example.com="" downloads="" white-paper.pdf="">; rel="canonical"</http:>

danatanseo

Hi Keith,

I'm sorry, I should have clarified. The rel=canonical tags would be on your Web pages, not the PDF (they are irrelevant in a PDF document). Then Google will attribute your Web page as the original source of the content and will understand that the PDF just contains bits of content from those pages. In this instance I would include a rel=canonical tag on every page of your site, just to cover your bases. Hope that helps!

Dana

540SEO

Not sure which page I would mark as being canonical, since the pdf contains content from several different pages on the site. I don't think it's possible to assign different rel=canonical tags to separate portions of a pdf, is it?

danatanseo

As long as you have rel=canonical tags properly in place, you don't need to worry about the PDF causing duplicate content problems. That way, any original content should be picked up and any duplicate can be attributed to your existing Web pages. Hope that's helpful!

Dana

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

PDF on financial site that duplicates ~50% of site content

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Backup Server causing duplicate content flag?

Mixing up languages on the same page + possible duplicate content

301 redirect to avoid duplicate content penalty

Content Aggregation Site: How much content per aggregated piece is too much?

How best to handle (legitimate) duplicate content?

Blog Duplicate Content

How should we handle syndicated content on a partner site?

How to manage duplicate content?