How to block "print" pages from indexing
-
I have a fairly large FAQ section and every article has a "print" button. Unfortunately, this is creating a page for every article which is muddying up the index - especially on my own site using Google Custom Search.
Can you recommend a way to block this from happening?
Example Article:
Example "Print" page:
http://www.knottyboy.com/lore/article.php?id=052&action=print
-
Donnie, I agree. However, we had the same problem on a website and here's what we did the canonical tag:
Over a period of 3-4 weeks, all those print pages disappeared from the SERP. Now if I take a print URL and do a cache: for that page, it shows me the web version of that page.
So yes, I agree the question was about blocking the pages from getting indexed. There's no real recipe here, it's about getting the right solution. Before canonical tag, robots.txt was the only solution. But now with canonical there (provided one has the time and resources available to implement it vs adding one line of text to robots.txt), you can technically 301 the pages and not have to stop/restrict the spiders from crawling them.
Absolutely no offence to your solution in any way. Both are indeed workable solutions. The best part is that your robots.txt solution takes 30 seconds to implement since you provided the actually disallow code :), so it's better.
-
Thanks Jennifer, will do! So much good information.
-
Sorry, but I have to jump in - do NOT use all of those signals simultaneously. You'll make a mess, and they'll interfere with each other. You can try Robots.txt or NOINDEX on the page level - my experience suggests NOINDEX is much more effective.
Also, do not nofollow the links yet - you'll block the crawl, and then the page-level cues (like NOINDEX) won't work. You can nofollow later. This is a common mistake and it will keep your fixes from working.
-
Josh, please read my and Dr. Pete's comments below. Don't nofollow the links, but do use the meta noindex,follow on the page.
-
Rel-canonical, in practice, does essentially de-index the non-canonical version. Technically, it's not a de-indexation method, but it works that way.
-
You are right Donnie. I've "good answered" you too.
I've gone ahead and updated my robots.txt file. As soon as I am able, I will use no indexon the page, no follow on the links, and rel=canonical.
This is just what I needed, a quick fix until I can make a more permanent solution.
-
Your welcome : )
-
Although you are correct... there is still more then one way to skin a chicken.
-
But the spiders still run on the page and read the canonical link, however with the robot text the spiders will not.
-
Yes, but Rel=Canonical does not block a page it only tells google which page to follow out of two pages.The question was how to block, not how to tell google which link to follow. I believe you gave credit to the wrong answer.
http://en.wikipedia.org/wiki/Canonical_link_element
This is not fair. lol
-
I have to agree with Jen - Robots.txt isn't great for getting indexed pages out. It's good for prevention, but tends to be unreliable as a cure. META NOINDEX is probably more reliable.
One trick - DON'T nofollow the print links, at least not yet. You need Google to crawl and read the NOINDEX tags. Once the ?print pages are de-indexed, you could nofollow the links, too.
-
Yes, it's strongly recommended. It should be fairly simple to populate this tag with the "full" URL of the article based on the article ID. This approach will not only help you get rid of the duplicate content issue, but a canonical tag essentially works like a 301 redirect. So from all search engine perspective you are 301'ing your print pages to the real web urls without redirecting the actual user's who are browsing the print pages if they need to.
-
Ya it is actually really useful. Unfortunately they are out of business now - so I'm hacking it on my own.
I will take your advice. I've shamefully never used rel= canonical before - so now is a good time to start.
-
True but using robots.txt does not keep them out of the index. Only using "noindex" will do that.
-
Thanks Donnie. Much appreciated!
-
I actually remember Lore from a while ago. It's an interesting, easy to use FAQ CMS.
Anyways, I would also recommend implementing Canonical Tags for any possible duplicate content issues. So whether it's the print or the web version, each one of them will contain a canonical tag pointing to the web url of that article in the section of your website.
rel="canonical" href="http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html" /> -
-
Try This.
User-agent: *
Disallow: /*&action=print
-
Theres more then one way to skin a chicken.
-
Rather than using robots.txt I'd use a noindex,follow tag instead to the page. This code goes into the tag for each print page. And it will ensure that the pages don't get indexed but that the links are followed.
-
That would be great. Do you mind giving me an example?
-
you can block in .robot text, every page that ends in action=print
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sitemap.gz is being indexed and is showing up in SERP instead of actual pages.
Sitemap.gz is being indexed and is showing up in SERP instead of actual pages. I recently uploaded my sitemap file - https://psglearning.com/sitemapcustom/sitemap-index.xml - via Search Console. The only record within the XML file is sitemaps.gz. When I searched for some content on my site - here is the search https://goo.gl/mqxBeq - I was shown the following search result, indicating that our GZ file is getting indexed instead of our pages. http://www.psglearning.com/catalog 1 http://www.psglearning.com ...www.psglearning.com/sitemapcustom/sitemap.gz... 1 https://www.psglearning.com/catalog/productdetails/9781284059656/ 1 https://www.psglearning.com/catalog/productdetails/9781284060454/ 1 ... My sitemap is listed at https://psglearning.com/sitemapcustom/sitemap-index.xml inside the sitemap the only reference is to sitemap.gz. Should we remove the link the the sitemap.gz within the xml file and just serve the actual page paths? <sitemapindex< span=""> xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"></sitemapindex<><sitemap></sitemap>https://www.psglearning.com/sitemapcustom/sitemap.gz<lastmod></lastmod>2017-06-12T09:41-04:00
Technical SEO | | pdowling0 -
Is it good to redirect million of pages on a single page?
My site has 10 lakh approx. genuine urls. But due to some unidentified bugs site has created irrelevant urls 10 million approx. Since we don’t know the origin of these non-relevant links, we want to redirect or remove all these urls. Please suggest is it good to redirect such a high number urls to home page or to throw 404 for these pages. Or any other suggestions to solve this issue.
Technical SEO | | vivekrathore0 -
Page disappeared from Google index. Google cache shows page is being redirected.
My URL is: http://shop.nordstrom.com/c/converse Hi. The week before last, my top Converse page went missing from the Google index. When I "fetch as Googlebot" I am able to get the page and "submit" it to the index. I have done this several times and still cannot get the page to show up. When I look at the Google cache of the page, it comes up with a different page. http://webcache.googleusercontent.com/search?q=cache:http://shop.nordstrom.com/c/converse shows: http://shop.nordstrom.com/c/pop-in-olivia-kim Back story: As far as I know we have never redirected the Converse page to the Pop-In page. However the reverse may be true. We ran a Converse based Pop-In campaign but that used the Converse page and not the regular Pop-In page. Though the page comes back with a 200 status, it looks like Google thinks the page is being redirected. We were ranking #4 for "converse" - monthly searches = 550,000. My SEO traffic for the page has tanked since it has gone missing. Any help would be much appreciated. Stephan
Technical SEO | | shop.nordstrom0 -
"One Page With Two Links To Same Page; We Counted The First Link" Is this true?
I read this to day http://searchengineland.com/googles-matt-cutts-one-page-two-links-page-counted-first-link-192718 I thought to myself, yep, thats what I been reading in Moz for years ( pitty Matt could not confirm that still the case for 2014) But reading though the comments Michael Martinez of http://www.seo-theory.com/ pointed out that Mat says "...the last time I checked, was 2009, and back then -- uh, we might, for example, only have selected one of the links from a given page."
Technical SEO | | PaddyDisplays
Which would imply that is does not not mean it always the first link. Michael goes on to say "Back in 2008 when Rand WRONGLY claimed that Google was only counting the first link (I shared results of a test where it passed anchor text from TWO links on the same page)" then goes on to say " In practice the search engine sometimes skipped over links and took anchor text from a second or third link down the page." For me this is significant. I know people that have had "SEO experts" recommend that they should have a blog attached to there e-commence site and post blog posts (with no real interest for readers) with anchor text links to you landing pages. I thought that posting blog post just for anchor text link was a waste of time if you are already linking to the landing page with in a main navigation as google would see that link first. But if Michael is correct then these type of blog posts anchor text link blog posts would have value But who is' right Rand or Michael?0 -
Choosing the right page for rel="canonical"
I am wondering how you would choose which page to use as a canonical ? All our articles sit in an article section and they are called in the url when linked from a particular category. Since some articles are in many categories, we may have several links for the same page. My first idea was to put the one in the article category as the canonical, but I wonder if Google will lose the context of the page for it's ranking because it will not be in the proper category. For exemple, this page in the article section : http://www.bdc.ca/en/advice_centre/articles/Pages/exporting_entering.aspx Same page in the Expand Your Sales > Going Global section : http://www.bdc.ca/EN/advice_centre/expand_your_sales/going_global_or_international_markets/Pages/RelatedArticles.aspx?PATH=/EN/advice_centre/articles/Pages/exporting_entering.aspx The second one has much more context related to it, like the breadcrumb is showing the path and the left menu is open at the right place. For this example, I would choose te second one, but some articles may be found in 2 or 3 categories. If you could share your lights on this it would be very appreciated ! Thanks
Technical SEO | | jfmonfette0 -
Should i do "Article Marketing" for my quotes site?
Hello members, Should i do Article Marketing for my quote site to have quality backlinks to my site? will it improve my rankings?
Technical SEO | | rimon56930 -
Does page speed affect what pages are in the index?
We have around 1.3m total pages, Google currently crawls on average 87k a day and our average page load is 1.7 seconds. Out of those 1.3m pages(1.2m being "spun up") google has only indexed around 368k and our SEO person is telling us that if we speed up the pages they will crawl the pages more and thus will index more of them. I personally don't believe this. At 87k pages a day Google has crawled our entire site in 2 weeks so they should have all of our pages in their DB by now and I think they are not index because they are poorly generated pages and it has nothing to do with the speed of the pages. Am I correct? Would speeding up the pages make Google crawl them faster and thus get more pages indexed?
Technical SEO | | upper2bits0 -
Rel="canonical" for PFDs?
Hello there, We have a lot of PDFs that seem to end up on other websites. I was wondering if there was a way to make sure that our website gets the credit/authority as the original creator. Besides linking directly from the PDF copy to our pages, is anyone aware of strategy for letting Google know that we are the original publishers? I know search engines can index HTML versions of PDFs, so is there anyway to get them to index a rel="canonical" tag as well? Thoughts/Ideas?
Technical SEO | | Tektronix0