How to block "print" pages from indexing
-
I have a fairly large FAQ section and every article has a "print" button. Unfortunately, this is creating a page for every article which is muddying up the index - especially on my own site using Google Custom Search.
Can you recommend a way to block this from happening?
Example Article:
Example "Print" page:
http://www.knottyboy.com/lore/article.php?id=052&action=print
-
Donnie, I agree. However, we had the same problem on a website and here's what we did the canonical tag:
Over a period of 3-4 weeks, all those print pages disappeared from the SERP. Now if I take a print URL and do a cache: for that page, it shows me the web version of that page.
So yes, I agree the question was about blocking the pages from getting indexed. There's no real recipe here, it's about getting the right solution. Before canonical tag, robots.txt was the only solution. But now with canonical there (provided one has the time and resources available to implement it vs adding one line of text to robots.txt), you can technically 301 the pages and not have to stop/restrict the spiders from crawling them.
Absolutely no offence to your solution in any way. Both are indeed workable solutions. The best part is that your robots.txt solution takes 30 seconds to implement since you provided the actually disallow code :), so it's better.
-
Thanks Jennifer, will do! So much good information.
-
Sorry, but I have to jump in - do NOT use all of those signals simultaneously. You'll make a mess, and they'll interfere with each other. You can try Robots.txt or NOINDEX on the page level - my experience suggests NOINDEX is much more effective.
Also, do not nofollow the links yet - you'll block the crawl, and then the page-level cues (like NOINDEX) won't work. You can nofollow later. This is a common mistake and it will keep your fixes from working.
-
Josh, please read my and Dr. Pete's comments below. Don't nofollow the links, but do use the meta noindex,follow on the page.
-
Rel-canonical, in practice, does essentially de-index the non-canonical version. Technically, it's not a de-indexation method, but it works that way.
-
You are right Donnie. I've "good answered" you too.
I've gone ahead and updated my robots.txt file. As soon as I am able, I will use no indexon the page, no follow on the links, and rel=canonical.
This is just what I needed, a quick fix until I can make a more permanent solution.
-
Your welcome : )
-
Although you are correct... there is still more then one way to skin a chicken.
-
But the spiders still run on the page and read the canonical link, however with the robot text the spiders will not.
-
Yes, but Rel=Canonical does not block a page it only tells google which page to follow out of two pages.The question was how to block, not how to tell google which link to follow. I believe you gave credit to the wrong answer.
http://en.wikipedia.org/wiki/Canonical_link_element
This is not fair. lol
-
I have to agree with Jen - Robots.txt isn't great for getting indexed pages out. It's good for prevention, but tends to be unreliable as a cure. META NOINDEX is probably more reliable.
One trick - DON'T nofollow the print links, at least not yet. You need Google to crawl and read the NOINDEX tags. Once the ?print pages are de-indexed, you could nofollow the links, too.
-
Yes, it's strongly recommended. It should be fairly simple to populate this tag with the "full" URL of the article based on the article ID. This approach will not only help you get rid of the duplicate content issue, but a canonical tag essentially works like a 301 redirect. So from all search engine perspective you are 301'ing your print pages to the real web urls without redirecting the actual user's who are browsing the print pages if they need to.
-
Ya it is actually really useful. Unfortunately they are out of business now - so I'm hacking it on my own.
I will take your advice. I've shamefully never used rel= canonical before - so now is a good time to start.
-
True but using robots.txt does not keep them out of the index. Only using "noindex" will do that.
-
Thanks Donnie. Much appreciated!
-
I actually remember Lore from a while ago. It's an interesting, easy to use FAQ CMS.
Anyways, I would also recommend implementing Canonical Tags for any possible duplicate content issues. So whether it's the print or the web version, each one of them will contain a canonical tag pointing to the web url of that article in the section of your website.
rel="canonical" href="http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html" /> -
-
Try This.
User-agent: *
Disallow: /*&action=print
-
Theres more then one way to skin a chicken.
-
Rather than using robots.txt I'd use a noindex,follow tag instead to the page. This code goes into the tag for each print page. And it will ensure that the pages don't get indexed but that the links are followed.
-
That would be great. Do you mind giving me an example?
-
you can block in .robot text, every page that ends in action=print
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Getting 'Indexed, not submitted in sitemap' for around a third of my site. But these pages ARE in the sitemap we submitted.
As in the title, we have a site with around 40k pages, but around a third of them are showing as "Indexed, not submitted in sitemap" in Google Search Console. We've double-checked the sitemaps we have submitted and the URLs are definitely in the sitemap. Any idea why this might be happening? Example URL with the error: https://www.teacherstoyourhome.co.uk/german-tutor/Egham Sitemap it is located on: https://www.teacherstoyourhome.co.uk/sitemap-subject-locations-surrey.xml
Technical SEO | | TTYH0 -
How to explain "No Return Tags" Error from non-existing page?
In the Search Console of our Google Webmaster account we see 3 "no return tags" errors. The attached screenshot shows the detail of one of these errors. I know that annotations must be confirmed from the pages they are pointing to. If page A links to page B, page B must link back to page A, otherwise the annotations may not be interpreted correctly. However, the originating URL (/#!/public/tutorial/website/joomla) doesn't exist anymore. How could these errors still show up? Screenshot%202016-07-11%2017.36.27.png?dl=0
Technical SEO | | Maximuxxx0 -
New pages need to be crawled & indexed
Hi there, When you add pages to a site, do you need to re-generate an XML site map and re-submit to Google/Bing? I see the option in Google Webmaster Tools under the "fetch as Google tool" to submit individual pages for indexing, which I am doing right now. Thanks,
Technical SEO | | SSFCU
Sarah0 -
Home page not indexed by any search engines
We are currently having an issue with our homepage not being indexed by any search engines. We recently transferred our domain to Godaddy and there was an issue with the DNS. When we typed our url into Google like this "https://www.mysite.com" nothing from the site came up in the search results, only our social media profiles. When we typed our url into Google like this "mysite.com" we were sent to a GoDaddy parked page. We've been able to fix the issue over at Godaddy and the url "mysite.com" is not being redirected to "https://mysite.com" but, Google and the other search engines have yet to respond. I would say our fix has been in place for at least 72 hours. Do I need to give this more time? I would think that at lease one search engine would have picked up on the change by now and would start indexing the site properly.
Technical SEO | | bcglf1 -
Skip indexing the search pages
Hi, I want all such search pages skipped from indexing www.somesite.com/search/node/ So i have this in robots.txt (Disallow: /search/) Now any posts that start with search are being blocked and in Google i see this message A description for this result is not available because of this site's robots.txt – learn more. How can i handle this and also how can i find all URL's that Google is blocking from showing Thanks
Technical SEO | | mtthompsons0 -
Pages removed from Google index?
Hi All, I had around 2,300 pages in the google index until a week ago. The index removed a load and left me with 152 submitted, 152 indexed? I have just re-submitted my sitemap and will wait to see what happens. Any idea why it has done this? I have seen a drop in my rankings since. Thanks
Technical SEO | | TomLondon0 -
Huge number of indexed pages with no content
Hi, We have accidentally had Google indexed lots os our pages with no useful content at all on them. The site in question is a directory site, where we have tags and we have cities. Some cities have suppliers for almost all the tags, but there are lots of cities, where we have suppliers for only a handful of tags. The problem occured, when we created a page for each cities, where we list the tags as links. Unfortunately, our programmer listed all the tags, so not only the ones, where we have businesses, offering their services, but all of them! We have 3,142 cities and 542 tags. I guess, that you can imagine the problem this caused! Now I know, that Google might simply ignore these empty pages and not crawl them again, but when I check a city (city site:domain) with only 40 providers, I still have 1,050 pages indexed. (Yes, we have some issues between the 550 and the 1050 as well, but first things first:)) These pages might not be crawled again, but will be clicked, and bounces and the whole user experience in itself will be terrible. My idea is, that I might use meta noindex for all of these empty pages and perhaps also have a 301 redirect from all the empty category pages, directly to the main page of the given city. Can this work the way I imagine? Any better solution to cut this really bad nightmare short? Thank you in advance. Andras
Technical SEO | | Dilbak0 -
Hyphenated Domain Names - "Spammy" or Not?
Some say hyphenated domain names are "spammy". I have also noticed that Moz's On Page Keyword Tool does NOT recognize keywords in a non-hyphenated domain name. So one would assume neither do the bots. I noticed obviously misleading words like car in carnival or spa in space or spatula, etc embedded in domain names and pondered the effect. I took it a step further with non-hyphenated domain names. I experimented by selecting totally random three or four letter blocks - Example: randomfactgenerator.net - rand omf act gene rator Each one of those clips returns copious results AND the On-Page Report Card does not credit the domain name as containing "random facts" as keywords**,** whereas www.business-sales-sarasota.com does get credit for "business sales sarasota" in the URL. This seems an obvious situation - unhyphenated domains can scramble the keywords and confuse the bots, as they search all possible combinations. YES - I know the content should carry it but - I do not believe domain names are irrelevant, as many say. I don't believe that hyphenated domain names are not more efficient than non hyphenated ones - as long as you don't overdo it. I have also seen where a weak site in an easy market will quickly top the list because the hyphenated domain name matches the search term - I have done it (in my pre Seo Moz days) with ft-myers-auto-air.com. I built the site in a couple of days and in a couple weeks it was on page one. Any thoughts on this?
Technical SEO | | dcmike0