How to block "print" pages from indexing
-
I have a fairly large FAQ section and every article has a "print" button. Unfortunately, this is creating a page for every article which is muddying up the index - especially on my own site using Google Custom Search.
Can you recommend a way to block this from happening?
Example Article:
Example "Print" page:
http://www.knottyboy.com/lore/article.php?id=052&action=print
-
Donnie, I agree. However, we had the same problem on a website and here's what we did the canonical tag:
Over a period of 3-4 weeks, all those print pages disappeared from the SERP. Now if I take a print URL and do a cache: for that page, it shows me the web version of that page.
So yes, I agree the question was about blocking the pages from getting indexed. There's no real recipe here, it's about getting the right solution. Before canonical tag, robots.txt was the only solution. But now with canonical there (provided one has the time and resources available to implement it vs adding one line of text to robots.txt), you can technically 301 the pages and not have to stop/restrict the spiders from crawling them.
Absolutely no offence to your solution in any way. Both are indeed workable solutions. The best part is that your robots.txt solution takes 30 seconds to implement since you provided the actually disallow code :), so it's better.
-
Thanks Jennifer, will do! So much good information.
-
Sorry, but I have to jump in - do NOT use all of those signals simultaneously. You'll make a mess, and they'll interfere with each other. You can try Robots.txt or NOINDEX on the page level - my experience suggests NOINDEX is much more effective.
Also, do not nofollow the links yet - you'll block the crawl, and then the page-level cues (like NOINDEX) won't work. You can nofollow later. This is a common mistake and it will keep your fixes from working.
-
Josh, please read my and Dr. Pete's comments below. Don't nofollow the links, but do use the meta noindex,follow on the page.
-
Rel-canonical, in practice, does essentially de-index the non-canonical version. Technically, it's not a de-indexation method, but it works that way.
-
You are right Donnie. I've "good answered" you too.
I've gone ahead and updated my robots.txt file. As soon as I am able, I will use no indexon the page, no follow on the links, and rel=canonical.
This is just what I needed, a quick fix until I can make a more permanent solution.
-
Your welcome : )
-
Although you are correct... there is still more then one way to skin a chicken.
-
But the spiders still run on the page and read the canonical link, however with the robot text the spiders will not.
-
Yes, but Rel=Canonical does not block a page it only tells google which page to follow out of two pages.The question was how to block, not how to tell google which link to follow. I believe you gave credit to the wrong answer.
http://en.wikipedia.org/wiki/Canonical_link_element
This is not fair. lol
-
I have to agree with Jen - Robots.txt isn't great for getting indexed pages out. It's good for prevention, but tends to be unreliable as a cure. META NOINDEX is probably more reliable.
One trick - DON'T nofollow the print links, at least not yet. You need Google to crawl and read the NOINDEX tags. Once the ?print pages are de-indexed, you could nofollow the links, too.
-
Yes, it's strongly recommended. It should be fairly simple to populate this tag with the "full" URL of the article based on the article ID. This approach will not only help you get rid of the duplicate content issue, but a canonical tag essentially works like a 301 redirect. So from all search engine perspective you are 301'ing your print pages to the real web urls without redirecting the actual user's who are browsing the print pages if they need to.
-
Ya it is actually really useful. Unfortunately they are out of business now - so I'm hacking it on my own.
I will take your advice. I've shamefully never used rel= canonical before - so now is a good time to start.
-
True but using robots.txt does not keep them out of the index. Only using "noindex" will do that.
-
Thanks Donnie. Much appreciated!
-
I actually remember Lore from a while ago. It's an interesting, easy to use FAQ CMS.
Anyways, I would also recommend implementing Canonical Tags for any possible duplicate content issues. So whether it's the print or the web version, each one of them will contain a canonical tag pointing to the web url of that article in the section of your website.
rel="canonical" href="http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html" /> -
-
Try This.
User-agent: *
Disallow: /*&action=print
-
Theres more then one way to skin a chicken.
-
Rather than using robots.txt I'd use a noindex,follow tag instead to the page. This code goes into the tag for each print page. And it will ensure that the pages don't get indexed but that the links are followed.
-
That would be great. Do you mind giving me an example?
-
you can block in .robot text, every page that ends in action=print
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Why is there a difference in the number of indexed pages shown by GWT and site: search?
Hi Moz Fans, I have noticed that there is a huge difference between the number of indexed pages of my site shown via site: search and the one that shows Webmaster Tools. While searching for my site directly in the browser (site:), there are about 435,000 results coming up. According to GWT there are over 2.000.000 My question is: Why is there such a huge difference and which source is correct? We have launched the site about 3 months ago, there are over 5 million urls within the site and we get lots of organic traffic from the very beginning. Hope you can help! Thanks! Aleksandra
Technical SEO | | aleker0 -
Rel="canonical" in hyperlink
Inside my website, I use the rel = "canonical" but I do not use it in the but in a hyperlink. Now it is not clear to me if that goes well. See namely different stories about the Internet. My example below link: Bruiloft
Technical SEO | | NECAnGeL0 -
How Does Google's "index" find the location of pages in the "page directory" to return?
This is my understanding of how Google's search works, and I am unsure about one thing in specific: Google continuously crawls websites and stores each page it finds (let's call it "page directory") Google's "page directory" is a cache so it isn't the "live" version of the page Google has separate storage called "the index" which contains all the keywords searched. These keywords in "the index" point to the pages in the "page directory" that contain the same keywords. When someone searches a keyword, that keyword is accessed in the "index" and returns all relevant pages in the "page directory" These returned pages are given ranks based on the algorithm The one part I'm unsure of is how Google's "index" knows the location of relevant pages in the "page directory". The keyword entries in the "index" point to the "page directory" somehow. I'm thinking each page has a url in the "page directory", and the entries in the "index" contain these urls. Since Google's "page directory" is a cache, would the urls be the same as the live website (and would the keywords in the "index" point to these urls)? For example if webpage is found at wwww.website.com/page1, would the "page directory" store this page under that url in Google's cache? The reason I want to discuss this is to know the effects of changing a pages url by understanding how the search process works better.
Technical SEO | | reidsteven750 -
How should i knows google to indexed my new pages ?
I have added many products in my ecommerce site but most of the google still not indexed yet. I already submitted sitemap a month ago but indexed process was very slow. Is there anyway to know the google to indexed my products or pages immediately. I can do ping but always doing ping is not the good idea. Any more suggestions ?
Technical SEO | | chandubaba1 -
WordPress - How to stop both http:// and https:// pages being indexed?
Just published a static page 2 days ago on WordPress site but noticed that Google has indexed both http:// and https:// url's. Usually I only get http:// indexed though. Could anyone please explain why this may have happened and how I can fix? Thanks!
Technical SEO | | Clicksjim1 -
Why is my office page not being indexed?
Good Morning from 24 degrees C partly cloudy wetherby UK 🙂 This page is not being indexed by Google:
Technical SEO | | Nightwing
http://www.sandersonweatherall.co.uk/office-to-let-leeds/ 1st Question Ive checked robots txt file no problems, i'm in the midst of updating the xml sitemap (it had the old one in place). It only has one link from this page http://www.sandersonweatherall.co.uk/Site-Map/ So is the reason oits not being indexed just a simple case of lack if SEO juice from inbound links so the remedy lies in routing more inbound links to the offending page? 2nd question Is the quickest way to diagnose if a web address is not being indexed to cut and paste the url in the Google search box and if it doesnt return the page theres a problem? Thanks in advance, David0 -
Remove Deleted (but indexed) Pages Through Webmaster Tools?
I run a blog/directory site. Recently, I changed directory software and, as a result, Google is showing 404 Not Found crawling errors for about 750 non-existent pages. I've had some suggest that I should implement a 301 redirect, but can't see the wisdom in this as the pages are obscure, unlikely to appear in search and they've been deleted. Is the best course to simply manually enter each 404 error page in to the Remove Page option in Webmaster Tools? Will entering deleted pages into the Removal area hurt other healthy pages on my site?
Technical SEO | | JSOC0 -
Which pages to "noindex"
I have read through the many articles regarding the use of Meta Noindex, but what I haven't been able to find is a clear explanation of when, why or what to use this on. I'm thinking that it would be appropriate to use it on: legal pages such as privacy policy and terms of use
Technical SEO | | mmaes
search results page
blog archive and category pages Thanks for any insight of this.0