Robots.txt: Link Juice vs. Crawl Budget vs. Content 'Depth'
-
I run a quality vertical search engine. About 6 months ago we had a problem with our sitemaps, which resulted in most of our pages getting tossed out of Google's index. As part of the response, we put a bunch of robots.txt restrictions in place in our search results to prevent Google from crawling through pagination links and other parameter based variants of our results (sort order, etc). The idea was to 'preserve crawl budget' in order to speed the rate at which Google could get our millions of pages back in the index by focusing attention/resources on the right pages.
The pages are back in the index now (and have been for a while), and the restrictions have stayed in place since that time. But, in doing a little SEOMoz reading this morning, I came to wonder whether that approach may now be harming us...
http://www.seomoz.org/blog/restricting-robot-access-for-improved-seo
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutionsSpecifically, I'm concerned that a) we're blocking the flow of link juice and that b) by preventing Google from crawling the full depth of our search results (i.e. pages >1), we may be making our site wrongfully look 'thin'. With respect to b), we've been hit by Panda and have been implementing plenty of changes to improve engagement, eliminate inadvertently low quality pages, etc, but we have yet to find 'the fix'...
Thoughts?
Kurus
-
I always advise people NOT to use the robots txt to block off pages - it isnt the best way to handle things. In your case, there may be two options that you can consider:
1. For variant pages, (multiple parameters of the same page) use the rel canonical to increase the strength of the original page, and to keep the variants out of the index.
2. A controversial one this, and many may disagree, but depends on situation basis - allow crawling of the page, but dont allow indexing - follow, no index, which would still pass any juice, but wont index pages that you dont want in the SERPs. I normally do this for Search Result Pages that get indexed...
-
Got disconnected by seomoz as I posted so here is the short answer :
You were affected by Pand so you may pages with almost no content. These pages may be the one using crawl budget, much more than the paginated results. Worry about these low value pages and let Google handle the paginated results
-
Baptiste,
Thanks for the feedback. Can you clarify what you mean by the following?
"On a side note, if you were impacted by Panda, I would strongly suggest to remove / disallow the empty pages on your site. This will give you more crawl budget for interesting content."
-
I would not dig too much in the crawl budget + pagination problem - Google knows what is a pagination and will increase the crawl budget when necessary. On the 'thin' vision of your site, I think your right and I would immediately allow pages > 1 to be indexed.
Beware this may or not impact a lot on your site, it depends on the navigation system (you may have a lot of paginated subsets).
What tells site: requests ? Do you have all your items submitted in your sitemaps and indexed (see WMT) ?
On a side note, if you were impacted by Panda, I would strongly suggest to remove / disallow the empty pages on your site. This will give you more crawl budget for interesting content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does redirecting a duplicate page NOT in Google‘s index pass link juice? (External links not showing in search console)
Hello! We have a powerful page that has been selected by Google as a duplicate page of another page on the site. The duplicate is not indexed by Google, and the referring domains pointing towards that page aren’t recognized by Google in the search console (when looking at the links report). My question is - if we 301 redirect the duplicate page towards the one that Google has selected as canonical, will the link juice be passed to the new page? Thanks!
Intermediate & Advanced SEO | | Lewald10 -
Crawl Depth improvements
Hi I'm checking the crawl depth report in SEM rush, and looking at pages which are 4+ clicks away. I have a lot of product pages which fall into this category. Does anyone know the impact of this? Will they never be found by Google? If there is anything in there I want to rank, I'm guessing the course of action is to move the page so it takes less clicks to get there? How important is the crawl budget and depth for SEO? I'm just starting to look into this subject Thank you
Intermediate & Advanced SEO | | BeckyKey0 -
If my website do not have a robot.txt file, does it hurt my website ranking?
After a site audit, I find out that my website don't have a robot.txt. Does it hurt my website rankings? One more thing, when I type mywebsite.com/robot.txt, it automatically redirect to the homepage. Please help!
Intermediate & Advanced SEO | | binhlai0 -
Long term strategy to retain link 'goodness', I need some help!
Hi, I have a few questions around the best approach to retain as much link juice / authority from transitioning multiple domains into 1 single domain over the next year or so. I have 2 similar websites (www.brandA.co.uk and www.brandB.co.uk) which I need to transition to a new website (www.brandC.co.uk) over the next 2 years. Both A&B are established and have there own brand value, brand C will be a new website. I need to start introducing the brand from website C onto A&B straight away and then eventually drop the brands from A&B and just be left with C. One idea I am considering is: www.brandA.co.uk becomes brandA.brandC.co.uk (brandA sits as a subdomain on brandC website) Ultimately over time I would drop the subdomain (brandA) and just be left with www.brandC.co.uk The other option is: www.brandA.co.uk becomes brandC.co.uk/brandA...with the same ultimate aim as above. In both above case the same would be done for brandB, either becoming a subdomain of a folder on brandC website What I need to know is what is the best way to first pass any SEO goodness from the websites for brandA and brandB to the intermediate solution of either brandA.brandC.co.uk or brandC.co.uk/brandA (I see this intermediate solution being in place for approx 2 years). And then how to transition the intermediate solution into just having brandC.co.uk Which solution will aid growing the SEO goodness on the final brandC.co.uk website? Does google see subdomains as part of the main domain and thus the main domain will benefit from any links going to the subdomain or is it better to always use /folders as google sees these as more part of one website? ...or is there another option that I haven't considered? I know it's rater confusing so please give me a shout if you want anymore info. Thanks James
Intermediate & Advanced SEO | | cewe0 -
PDF for link building - avoiding duplicate content
Hello, We've got an article that we're turning into a PDF. Both the article and the PDF will be on our site. This PDF is a good, thorough piece of content on how to choose a product. We're going to strip out all of the links to our in the article and create this PDF so that it will be good for people to reference and even print. Then we're going to do link building through outreach since people will find the article and PDF useful. My question is, how do I use rel="canonical" to make sure that the article and PDF aren't duplicate content? Thanks.
Intermediate & Advanced SEO | | BobGW0 -
About robots.txt for resolve Duplicate content
I have a trouble with Duplicate content and title, i try to many way to resolve them but because of the web code so i am still in problem. I decide to use robots.txt to block contents that are duplicate. The first Question: How do i use command in robots.txt to block all of URL like this: http://vietnamfoodtour.com/foodcourses/Cooking-School/
Intermediate & Advanced SEO | | magician
http://vietnamfoodtour.com/foodcourses/Cooking-Class/ ....... User-agent: * Disallow: /foodcourses ( Is that right? ) And the parameter URL: h
ttp://vietnamfoodtour.com/?mod=vietnamfood&page=2
http://vietnamfoodtour.com/?mod=vietnamfood&page=3
http://vietnamfoodtour.com/?mod=vietnamfood&page=4 User-agent: * Disallow: /?mod=vietnamfood ( Is that right? i have folder contain module, could i use: disallow:/module/*) The 2nd question is: Which is the priority " robots.txt" or " meta robot"? If i use robots.txt to block URL, but in that URL my meta robot is "index, follow"0 -
When Google's WMT shows thousands of links from a single domain... Should they be removed?
Hi, Looking at Google's WMT "links to your site" it shows few sites that have thousands of links pointing to mine. There are actually only 1-2 links pointing to me from a site that Google shows 2000.
Intermediate & Advanced SEO | | BeytzNet
I assume that it is simply because they don't have canonical tags. Should I ask for the 2 links to be removed? Thanks0 -
Noindex,follow is a waste of link juice?
On my wordpress shopping cart plugin, I have three pages /account, /checkout and /terms on which I have added “noindex,follow” attribute. But I think I may be wasting link juice on these pages as they are not to be indexed anyway, so is there any point giving them any link juice? I can add “noindex,nofollow” on to the page itself. However, the actual text/anchor link to these pages on the site header will remain “follow” as I have no means of amending that right now. So this presents the following two scenarios – No juice flows from homepage to these 3 pages (GOOD) – This would be perfect then, as the pages themselves have nofollow attribute. Juice flows from homepage to these pages (BAD) - This may mean that the juice flows from homepage anchor text links to these 3 pages BUT then STOPS there as they have “nofollow” attribute on that page. This will be a bigger problem and if this is the case and I cant stop the juice from flowing in, then ill rather let it flow out to other pages. Hope you understand my question, any input is very much appreciated. Thanks
Intermediate & Advanced SEO | | SamBuck1