Block in robots.txt instead of using canonical?
-
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed?
In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
-
With this info, I would go with Robots.txt because, as you say, it outweighs any potential loss given the use of the pages and the absence of links.
Thanks
-
Thanks Robert.
The pages that I'm talking about disallowing do not have rank or links. They are sub-pages of a profile page. If anything, the main page will be linked to, not the sub-pages.
Maybe I should have explained that I'm talking about a large site - around 400K pages. More than 1,000 new pages are created per week. That's why I am concerned about managing crawl budget. The pages that I'm referring to are not linked to anywhere on the site. Sure, Google can potentially get to them if someone decides to link to them on their own site, but this is unlikely and certainly won't happen on a large scale. So I'm not really concerned about about losing pagerank on the main profile page if I disallow them. To be clear: we have many thousands of pages with content that we want to rank. The pages I'm talking about are not important in those terms.
So it's really a question of balance... if these pages (there are MANY of them) are included in the crawl (and in our sitemap), potentially it's a real waste of crawl budget. Doesn't this outweigh the minuscule, far-fetched potential loss?
I understand that Google designed rel=canonical for this scenario, but that does not mean that it's necessarily the best way to go considering the other options.
-
Thanks Takeshi.
Maybe I should have explained that I'm talking about a large site - around 400K pages. More than 1,000 new pages are created per week. That's why I am concerned about managing crawl budget. The pages that I'm referring to are not linked to anywhere on the site. Sure, Google can potentially get to them if someone decides to link to them on their own site, but this is unlikely (since it's a sub-page of the main profile page, which is where people would naturally link to) and certainly won't happen on a large scale. So I'm not really concerned about about link-juice evaporation. According to AJ Kohn here, it's not enough to see in Webmaster Tools that Google has indexed all pages on our site. There is also the issue of how often pages are being crawled, which is what we are trying to optimize for.
So it's really a question of balance... if these pages (there are MANY of them) are included in the crawl (and in our sitemap), potentially it's a real waste of crawl budget. Doesn't this outweigh the minuscule, far-fetched potential loss?
Would love to hear your thoughts...
-
I would go with the canonicals. If there are any links going to these duplicate pages, that will prevent any "link juice evaporation" from links which Google can see but can't crawl due to robots.txt. Best to let Google just crawl the page and see the canonical so that it understands that it is a duplicate page.
Having canonicals on all your pages is good practice anyway, as it can prevent inadvertent duplicate content from things like query parameters.
Crawl budget can be of some concern if you're talking about a massive number of pages, but start by first taking a look at Google Webmaster Tools and seeing how many of your pages are being crawled vs the total number of pages on your site. As long as this ration isn't small, you should be good. You can also get more crawl budget by building up your domain authority by building links.
-
I don't disagree at all and I think AJ Kohn is a rock star. In SEO, I have learned over time that there are rarely absolutes like always do this or never do that. I based my answer on how you posited the question.
If you read AJ's post you will note that the rel=canonical issue comes up with others commenting and not in the body of his post. Yes, if the page is superfluous like a cart page or a contact page, use the robots.txt to block the crawl. But, if you have a page with rank, links, etc. that help your canonical page, how are you helping yourself by forgoing rel=canon?
I think his bigger point was that you want to be aware and to understand that the # of times you are crawled is at least partially governed by PR which is governed by all those other things we discussed. If you understand that and keep the crawl focused on better pages you help yourself.
Does that clarify a bit?
Best -
Hi, even if you use robots.txt file to block these pages, Google can still pick the references of these pages from third-party websites and can crawl from there. Such pages will not have the description snippet in the search results and instead will show text that reads:
A description of this result is not available because of this site's robots.txt.
So, to fully stop Google from crawling these pages, you can go in for the page-level meta robots tag along with the robots.txt method. The page-level robots meta tag complements robots.txt method.By the way, robots.txt file can definitely save you some crawl budget. I don't think you should be thinking much about crawl budget though, as long as your website is super-easy to crawl with simple text-based internal links and stuff like, super-fast servers etc.,
Those my my two cents my friend.
Best regards,
Devanur Rafi
-
Thanks for the response, Robert.
I have read lots of SEO advice on maximizing your "crawl budget" - making sure your internal link system is built well to send the bots to the right pages. According to my research, since bots only spend a certain amount of time on your site when they are crawling, it is important to do whatever you can to ensure that they don't "waste time" on pages that are not important for SEO. Just as one example, see this post from AJ Kohn.
Do you disagree with this whole approach?
-
Yair
I think that the canonical is the better option. I am unsure as to your use of the term "crawl budget," in that there is no fixed number of times a page or a site will be crawled versus a second similar site for example. I have a huge reference site that is crawled every couple of days and I have small sites of ten pages that are crawled weekly or less. It is dependent on the traffic and behaviors of that traffic (which would include number of inbound links, etc.) and on things like you re-submitting sitemap, etc.
The canonical tag was created to provide the clarification to the search engine as to what you considered to be the relevant page. Go ahead and use it.Best
Robert
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I be using meta robots tags on thank you pages with little content?
I'm working on a website with hundreds of thank you pages, does it make sense to no follow, no index these pages since there's little content on them? I'm thinking this should save me some crawl budget overall but is there any risk in cutting out the internal links found on the thank you pages? (These are only standard site-wide footer and navigation links.) Thanks!
Intermediate & Advanced SEO | | GSO0 -
Question about using abbreviation
Hello, I have this abbreviation inside my domain name, ok? now for a page URL name, do you recommend me to use the actual word (which shortened form of it is inside domain name) in a page name? Or when have abbreviation in domain name, then using its actual word in a page name is not good? It's all about how much google recognize abbreviation as the actual word and gives the same value of word to it? do I risk not using the actual word? Hope made myself clear ) thanks.
Intermediate & Advanced SEO | | mdmoz0 -
Circular Canonical/Redirect
My client's site has an issue (see below) and I'm wondering how much it could be affecting crawlability. Has anyone seen a major rankings bump after fixing something like this? 1. In each page the rel=canonical is pointing to the http version of the page while the http version is redirecting to the https version. Basically, a circular redirect-canonical loop is occurring.2. The sitemap.xml is also referring to the http version of the pages rather than the https.
Intermediate & Advanced SEO | | elenaroi0 -
Use of Rel=Canonical
I have been pondering whether I am using this tag correctly or not. We have a custom solution which lays out products in the typical eCommerce style with plenty of tick box filters to further narrow down the view. When I last researched this it seemed like a good idea to implement rel=canonical to point all sub section pages at a 'view-all' page which returns all the products unfiltered for that given section. Normally pages are restricted down to 9 results per page with interface options to increase that. This combined with all the filters we offer creates many millions of possible page permutations and hence the need for the Canonical tag. I am concerned because our view-all pages get large, returning all of that section's product into one place.If I pointed the view-all page at say the first page of x results would that defeat the object of the view-all suggestion that Google made a few years back as it would require further crawling to get at all the data? Alternatively as these pages are just product listings, would NoIndex be a better route to go given that its unlikely they will get much love in Google anyway?
Intermediate & Advanced SEO | | motiv80 -
Scanning For Duplicate Canonical Tags
I'm looking for a solution for identifying pages on a site that have either empty/undefined canonical tags, or duplicate canonical tags (meaning the tag occurs twice within the same page). I've used Screaming Frog to view sitewide canonical values, but the tool cannot identify when pages use the tag twice, nor can it differentiate between pages that have an empty canonical tag and pages that have no canonical tag at all. Any help finding a tool of some sort that can assist me in doing this would be much appreciated, as I'm working with tens of thousands of pages and can't do this manually.
Intermediate & Advanced SEO | | edmundsseo0 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0 -
Rel canonical and duplicate subdomains
Hi, I'm working with a site that has multiple sub domains of entirely duplicate content. So, the production level site that visitors see is (for made-up illustrative example): 123abc456.edu Then, there are sub domains which are used by different developers to work on their own changes to the production site, before those changes are pushed to production: Larry.123abc456.edu Moe.123abc456.edu Curly.123abc456.edu Google ends up indexing these duplicate sub domains, which is of course not good. If we add a canonical tag to the head section of the production page (and therefor all of the duplicate sub domains) will that cause some kind of problem... having a canonical tag on a page pointing to itself? Is it okay to have a canonical tag on a page pointing to that same page? To complete the example... In this example, where our production page is 123abc456.edu, our canonical tag on all pages (this page and therefor the duplicate subdomains) would be: Is that going to be okay and fix this without causing some new problem of a canonical tag pointing to the page it's on? Thanks!
Intermediate & Advanced SEO | | 945010 -
Canonical URL Question
Hi Everyone I like to run this question by the community and get a second opinion on best practices for an issue that I ran into. I got two pages, Page A is the original page and Page B is the page with duplicate content. We already added** ="Page A**" />** to the duplicate content (Page B).** **Here is my question, since Page B is duplicate content and there is a link rel="canonical" added to it, would you put in the time to add meta tags and optimize the title of the page? Thanks in advance for all your help.**
Intermediate & Advanced SEO | | DRTBA0