Robots.txt: Link Juice vs. Crawl Budget vs. Content 'Depth'

kurus

I run a quality vertical search engine. About 6 months ago we had a problem with our sitemaps, which resulted in most of our pages getting tossed out of Google's index. As part of the response, we put a bunch of robots.txt restrictions in place in our search results to prevent Google from crawling through pagination links and other parameter based variants of our results (sort order, etc). The idea was to 'preserve crawl budget' in order to speed the rate at which Google could get our millions of pages back in the index by focusing attention/resources on the right pages.

The pages are back in the index now (and have been for a while), and the restrictions have stayed in place since that time. But, in doing a little SEOMoz reading this morning, I came to wonder whether that approach may now be harming us...

http://www.seomoz.org/blog/restricting-robot-access-for-improved-seo
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions

Specifically, I'm concerned that a) we're blocking the flow of link juice and that b) by preventing Google from crawling the full depth of our search results (i.e. pages >1), we may be making our site wrongfully look 'thin'. With respect to b), we've been hit by Panda and have been implementing plenty of changes to improve engagement, eliminate inadvertently low quality pages, etc, but we have yet to find 'the fix'...

Thoughts?

Kurus

rishil

I always advise people NOT to use the robots txt to block off pages - it isnt the best way to handle things. In your case, there may be two options that you can consider:

1. For variant pages, (multiple parameters of the same page) use the rel canonical to increase the strength of the original page, and to keep the variants out of the index.

2. A controversial one this, and many may disagree, but depends on situation basis - allow crawling of the page, but dont allow indexing - follow, no index, which would still pass any juice, but wont index pages that you dont want in the SERPs. I normally do this for Search Result Pages that get indexed...

baptisteplace

Got disconnected by seomoz as I posted so here is the short answer :

You were affected by Pand so you may pages with almost no content. These pages may be the one using crawl budget, much more than the paginated results. Worry about these low value pages and let Google handle the paginated results

kurus

Baptiste,

Thanks for the feedback. Can you clarify what you mean by the following?

"On a side note, if you were impacted by Panda, I would strongly suggest to remove / disallow the empty pages on your site. This will give you more crawl budget for interesting content."

baptisteplace

I would not dig too much in the crawl budget + pagination problem - Google knows what is a pagination and will increase the crawl budget when necessary. On the 'thin' vision of your site, I think your right and I would immediately allow pages > 1 to be indexed.

Beware this may or not impact a lot on your site, it depends on the navigation system (you may have a lot of paginated subsets).

What tells site: requests ? Do you have all your items submitted in your sitemaps and indexed (see WMT) ?

On a side note, if you were impacted by Panda, I would strongly suggest to remove / disallow the empty pages on your site. This will give you more crawl budget for interesting content.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt: Link Juice vs. Crawl Budget vs. Content 'Depth'

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Re: Inbound Links. Whether it's HTTP or HTTPS, does it still go towards the same inbound link count?

Is robots met tag a more reliable than robots.txt at preventing indexing by Google?

Duplicate Content: Organic vs Local SEO

What's the news on sitwide nofollow links and anchor text penalties

How does the crawl find duplicate pages that don't exist on the site?

Outreach - When is it important to wait vs ask for the link with the first email?

Google consolidating link juice on duplicate content pages

Subdomains - duplicate content - robots.txt