PDF on financial site that duplicates ~50% of site content
-
I have a financial advisor client who has a downloadable PDF on his site that contains about 9 pages of good info. Problem is much of the content can also be found on individual pages of his site.
Is it best to noindex/follow the pdf? It would be great to let the few pages of original content be crawlable, but I'm concerned about the duplicate content aspect.
Thanks --
-
This is what we have done with pdfs. Assign rel="canonical" in .htaccess.
We did this with a few hundred files and it took google a LONG time to find and credit them.
-
You could set the header to noindex rather than rel=canonical
-
Personally I think it would be better not to index, it but if necessary, the index folder root seems like a good option
-
Thanks. Anybody want to weigh in on where to rel=canonical to? Home page?
-
If you are using apache, you should put it on your .htaccess with this form
<filesmatch “my-file.pdf”="">Header set Link ‘<http: misite="" my-file.html="">; rel=”canonical”‘</http:></filesmatch>
-
I think the right way here is to put the rel canonical in PDF header http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html
-
I thought the idea was to put rel=canonical on the duplicated page, to signal that "hey, this page may look like duplicate content, but please refer to this canonical URL"?
Looks like there is a pdf option for rel=canonical, I guess the question is, what page on the site to make canonical?
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394
Indicate the canonical version of a URL by responding with the
Link rel="canonical"
HTTP header. Addingrel="canonical"
to thehead
section of a page is useful for HTML content, but it can't be used for PDFs and other file types indexed by Google Web Search. In these cases you can indicate a canonical URL by responding with theLink rel="canonical"
HTTP header, like this (note that to use this option, you'll need to be able to configure your server):Link: <http: www.example.com="" downloads="" white-paper.pdf="">; rel="canonical"</http:>
-
Hi Keith,
I'm sorry, I should have clarified. The rel=canonical tags would be on your Web pages, not the PDF (they are irrelevant in a PDF document). Then Google will attribute your Web page as the original source of the content and will understand that the PDF just contains bits of content from those pages. In this instance I would include a rel=canonical tag on every page of your site, just to cover your bases. Hope that helps!
Dana
-
Not sure which page I would mark as being canonical, since the pdf contains content from several different pages on the site. I don't think it's possible to assign different rel=canonical tags to separate portions of a pdf, is it?
-
As long as you have rel=canonical tags properly in place, you don't need to worry about the PDF causing duplicate content problems. That way, any original content should be picked up and any duplicate can be attributed to your existing Web pages. Hope that's helpful!
Dana
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate content issue
Hello! We have a lot of duplicate content issues on our website. Most of the pages with these issues are dictionary pages (about 1200 of them). They're not exactly duplicate, but they contain a different word with a translation, picture and audio pronunciation (example http://anglu24.lt/zodynas/a-suitcase-lagaminas). What's the better way of solving this? We probably shouldn't disallow dictionary pages in robots.txt, right? Thanks!
Intermediate & Advanced SEO | | jpuzakov0 -
Duplicate content issue with pages that have navigation
We have a large consumer website with several sections that have navigation of several pages. How would I prevent the pages from getting duplicate content errors and how best would I handle SEO for these? For example we have about 500 events with 20 events showing on each page. What is the best way to prevent all the subsequent navigation pages from getting a duplicate content and duplicate title error?
Intermediate & Advanced SEO | | roundbrix0 -
Duplicate Content Pages - A Few Queries..
I am working through the latest Moz Crawl Report and focusing on the 'high priority' issues of Duplicate Page Content. There are some strange instances being flagged and so wondered whether anyone has any knowledge as to why this may be happening... Here is an example; This page; http://www.bolsovercruiseclub.com/destinations/cruise-breaks-&-british-isles/bruges/ ...is apparently duplicated with these pages; http://www.bolsovercruiseclub.com/guides/excursions http://www.bolsovercruiseclub.com/guides/cruises-from-the-uk http://www.bolsovercruiseclub.com/cruise-deals/norwegian-star-europe-cruise-deals Not sure why...? Also, pages that are on our 'Cruise Reviews' section such as this page; http://www.bolsovercruiseclub.com/cruise-reviews/p&o-cruises/adonia/cruising/931 ...are being flagged as duplicated content with a page like this; http://www.bolsovercruiseclub.com/destinations/cruise-breaks-&-british-isles/bilbao/ Is this a 'thin content' issue i.e. 2 pages have 'thin content' and are therefore duplicated? If so, the 'destinations' page can (and will be) rewritten with more content (and images) but the 'cruise reviews' are written by customers and so we are unable to do anything there... Hope that all makes sense?! Andy
Intermediate & Advanced SEO | | TomKing0 -
Duplicate content based on filters
Hi Community, There have probably been a few answers to this and I have more or less made up my mind about it but would like to pose the question or as that you post a link to the correct article for this please. I have a travel site with multiple accommodations (for example), obviously there are many filter to try find exactly what you want, youcan sort by region, city, rating, price, type of accommodation (hotel, guest house, etc.). This all leads to one invevitable conclusion, many of the results would be the same. My question is how would you handle this? Via a rel canonical to the main categories (such as region or town) thus making it the successor, or no follow all the sub-category pages, thereby not allowing any search to reach deeper in. Thanks for the time and effort.
Intermediate & Advanced SEO | | ProsperoDigital0 -
Best Way to Incorporate FAQs into Every Page - Duplicate Content?
Hi Mozzers, We want to incorporate a 'Dictionary' of terms onto quite a few pages on our site, similar to an FAQ system. The 'Dictionary' has 285 terms in it, with about 1 sentence of content for each one (approximately 5,000 words total). The content is unique to our site and not keyword stuffed, but I am unsure what Google will think about us having all this shared content on these pages. I have a few ideas about how we can build this, but my higher-ups really want the entire dictionary on every page. Thoughts? Image of what we're thinking here - http://screencast.com/t/GkhOktwC4I Thanks!
Intermediate & Advanced SEO | | Travis-W0 -
Category Pages For Distributing Authority But Not Creating Duplicate Content
I read this interesting moz guide: http://moz.com/learn/seo/robotstxt, which I think answered my question but I just want to make sure. I take it to mean that if I have category pages with nothing but duplicate content (lists of other pages (h1 title/on-page description and links to same) and that I still want the category pages to distribute their link authority to the individual pages, then I should leave the category pages in the site map and meta noindex them, rather than robots.txt them. Is that correct? Again, don't want the category pages to index or have a duplicate content issue, but do want the category pages to be crawled enough to distribute their link authority to individual pages. Given the scope of the site (thousands of pages and hundreds of categories), I just want to make sure I have that right. Up until my recent efforts on this, some of the category pages have been robot.txt'd out and still in the site map, while others (with different url structure) have been in the sitemap, but not robots.txt'd out. Thanks! Best.. Mike
Intermediate & Advanced SEO | | 945010 -
Duplicate content clarity required
Hi, I have access to a masive resource of journals that we have been given the all clear to use the abstract on our site and link back to the journal. These will be really useful links for our visitors. E.g. http://www.springerlink.com/content/59210832213382K2 Simply, if we copy the abstract and then link back to the journal source will this be treated as duplicate content and damage the site or is the link to the source enough for search engines to realise that we aren't trying anything untoward. Would it help if we added an introduction so in effect we are sort of following the curating content model? We are thinking of linking back internally to a relevant page using a keyword too. Will this approach give any benefit to our site at all or will the content be ignored due to it being duplicate and thus render the internal links useless? Thanks Jason
Intermediate & Advanced SEO | | jayderby0 -
Having a hard time with duplicate page content
I'm having a hard time redirecting website.com/ to website.com The crawl report shows both versions as duplicate content. Here is my htaccess: RewriteEngine On
Intermediate & Advanced SEO | | cgman
RewriteBase /
#Rewrite bare to www
RewriteCond %{HTTP_HOST} ^mywebsite.com
RewriteRule ^(([^/]+/)*)index.php$ http://www.mywebsite.com/$1 [R=301,L] RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule ^(.*)$ $1.php [NC,L]
RewriteCond %{HTTP_HOST} !^.localhost$ [NC]
RewriteRule ^(.+)/$ http://%{HTTP_HOST}$1 [R=301,L] I added the last 2 lines after seeing a Q&A here, but I don't think it has helped.0