Robots.txt: Link Juice vs. Crawl Budget vs. Content 'Depth'
-
I run a quality vertical search engine. About 6 months ago we had a problem with our sitemaps, which resulted in most of our pages getting tossed out of Google's index. As part of the response, we put a bunch of robots.txt restrictions in place in our search results to prevent Google from crawling through pagination links and other parameter based variants of our results (sort order, etc). The idea was to 'preserve crawl budget' in order to speed the rate at which Google could get our millions of pages back in the index by focusing attention/resources on the right pages.
The pages are back in the index now (and have been for a while), and the restrictions have stayed in place since that time. But, in doing a little SEOMoz reading this morning, I came to wonder whether that approach may now be harming us...
http://www.seomoz.org/blog/restricting-robot-access-for-improved-seo
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutionsSpecifically, I'm concerned that a) we're blocking the flow of link juice and that b) by preventing Google from crawling the full depth of our search results (i.e. pages >1), we may be making our site wrongfully look 'thin'. With respect to b), we've been hit by Panda and have been implementing plenty of changes to improve engagement, eliminate inadvertently low quality pages, etc, but we have yet to find 'the fix'...
Thoughts?
Kurus
-
I always advise people NOT to use the robots txt to block off pages - it isnt the best way to handle things. In your case, there may be two options that you can consider:
1. For variant pages, (multiple parameters of the same page) use the rel canonical to increase the strength of the original page, and to keep the variants out of the index.
2. A controversial one this, and many may disagree, but depends on situation basis - allow crawling of the page, but dont allow indexing - follow, no index, which would still pass any juice, but wont index pages that you dont want in the SERPs. I normally do this for Search Result Pages that get indexed...
-
Got disconnected by seomoz as I posted so here is the short answer :
You were affected by Pand so you may pages with almost no content. These pages may be the one using crawl budget, much more than the paginated results. Worry about these low value pages and let Google handle the paginated results
-
Baptiste,
Thanks for the feedback. Can you clarify what you mean by the following?
"On a side note, if you were impacted by Panda, I would strongly suggest to remove / disallow the empty pages on your site. This will give you more crawl budget for interesting content."
-
I would not dig too much in the crawl budget + pagination problem - Google knows what is a pagination and will increase the crawl budget when necessary. On the 'thin' vision of your site, I think your right and I would immediately allow pages > 1 to be indexed.
Beware this may or not impact a lot on your site, it depends on the navigation system (you may have a lot of paginated subsets).
What tells site: requests ? Do you have all your items submitted in your sitemaps and indexed (see WMT) ?
On a side note, if you were impacted by Panda, I would strongly suggest to remove / disallow the empty pages on your site. This will give you more crawl budget for interesting content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallowed "Search" results with robots.txt and Sessions dropped
Hi
Intermediate & Advanced SEO | | Frankie-BTDublin
I've started working on our website and I've found millions of "Search" URL's which I don't think should be getting crawled & indexed (e.g. .../search/?q=brown&prefn1=brand&prefv1=C.P. COMPANY|AERIN|NIKE|Vintage Playing Cards|BIALETTI|EMMA PAKE|QUILTS OF DENMARK|JOHN ATKINSON|STANCE|ISABEL MARANT ÉTOILE|AMIRI|CLOON KEEN|SAMSONITE|MCQ|DANSE LENTE|GAYNOR|EZCARAY|ARGOSY|BIANCA|CRAFTHOUSE|ETON). I tried to disallow them on the Robots.txt file, but our Sessions dropped about 10% and our Average Position on Search Console dropped 4-5 positions over 1 week. Looks like over 50 Million URL's have been blocked, and all of them look like all of them are like the example above and aren't getting any traffic to the site. I've allowed them again, and we're starting to recover. We've been fixing problems with getting the site crawled properly (Sitemaps weren't added correctly, products blocked from spiders on Categories pages, canonical pages being blocked from Crawlers in robots.txt) and I'm thinking Google were doing us a favour and using these pages to crawl the product pages as it was the best/only way of accessing them. Should I be blocking these "Search" URL's, or is there a better way about going about it??? I can't see any value from these pages except Google using them to crawl the site.0 -
meta robots no follow on page for paid links
Hi I have a page containing paid links. i would like to add no follow attribute to these links
Intermediate & Advanced SEO | | Kung_fu_Panda
but from technical reasons, i can only place meta robots no follow on page level (
is that enough for telling Google that the links in this page are paid and and to prevent Google penlizling the sites that the page link to? Thanks!0 -
"noindex, follow" or "robots.txt" for thin content pages
Does anyone have any testing evidence what is better to use for pages with thin content, yet important pages to keep on a website? I am referring to content shared across multiple websites (such as e-commerce, real estate etc). Imagine a website with 300 high quality pages indexed and 5,000 thin product type pages, which are pages that would not generate relevant search traffic. Question goes: Does the interlinking value achieved by "noindex, follow" outweigh the negative of Google having to crawl all those "noindex" pages? With robots.txt one has Google's crawling focus on just the important pages that are indexed and that may give ranking a boost. Any experiments with insight to this would be great. I do get the story about "make the pages unique", "get customer reviews and comments" etc....but the above question is the important question here.
Intermediate & Advanced SEO | | khi50 -
Created the content, yet we don't rank for it. Toxic website?
Hey everyone, I'm beginning to think our site is toxic i.e. it'll never rank properly again irrespective of what we do. I recently published some data (2 months ago) in an interactive visual called the "iPhone 5S Price Index". I outreached and got thousands of links from sites including Forbes, Gizmodo (various international versions), Washington Post, The Guardian, NY Times, etc etc. All of these results dominate the Google rankings, all with links pointing to us. YET, we're no where to be seen. What incentive are Google giving content creators, like me, to continue producing content that is obviously popular if we can't even rank for it? The traffic we received was fantastic. In one day the traffic was 40 times our average, which made me smile like a Cheshire Cat from ear-to-ear but we need to improve our rankings overall otherwise the value to us is lost. The traffic wasn't there to buy our service, they were there to see the graphic. Hopefully our brand exposure leads to future sales, but it's a pittance compared to our previous rankings income. I've had this type of success 3 times in the last few months on this site alone. Yet nothing changes. We suffered from a loss of rankings in September 2012, fighting ever since to get it back. Now I'm losing hope it is even possible. Does anyone know why our site wouldn't rank when we're undeniable the source that created the work? Also, why wouldn't the increase in domain authority (which has jumped about 10 points according to OSE) have a knock on effect for the rest of our keywords - or even let us appear within the top 100 for ones we obviously serve? We do Real Company Shit - and we're good at it. But I need these rankings back. It's driving me nuts. Thanks.
Intermediate & Advanced SEO | | purpleindigo0 -
How to remove duplicate content, which is still indexed, but not linked to anymore?
Dear community A bug in the tool, which we use to create search-engine-friendly URLs (sh404sef) changed our whole URL-structure overnight, and we only noticed after Google already indexed the page. Now, we have a massive duplicate content issue, causing a harsh drop in rankings. Webmaster Tools shows over 1,000 duplicate title tags, so I don't think, Google understands what is going on. <code>Right URL: abc.com/price/sharp-ah-l13-12000-btu.html Wrong URL: abc.com/item/sharp-l-series-ahl13-12000-btu.html (created by mistake)</code> After that, we ... Changed back all URLs to the "Right URLs" Set up a 301-redirect for all "Wrong URLs" a few days later Now, still a massive amount of pages is in the index twice. As we do not link internally to the "Wrong URLs" anymore, I am not sure, if Google will re-crawl them very soon. What can we do to solve this issue and tell Google, that all the "Wrong URLs" now redirect to the "Right URLs"? Best, David
Intermediate & Advanced SEO | | rmvw0 -
What's the best way to manage content that is shared on two sites and keep both sites in search results?
I manage two sites that share some content. Currently we do not use a cross-domain canonical URL and allow both sites to be fully indexed. For business reasons, we want both sites to appear in results and need both to accumulate PR and other SEO/Social metrics. How can I manage the threat of duplicate content and still make sure business needs are met?
Intermediate & Advanced SEO | | BostonWright0 -
I currently have a client that has multiple domains for multiple brands that share the same IP Address. Will link juice be passed along to the different sites when they link to one another or will it simply be considered internal linking?
I have 7 brands that are owned by the same company, each with their own domain. The brands work together to form products that are then sold to the consumer although there is not a e-commerce aspect to any of the sites. I am looking to create a modified link wheel between the sites, but didn't know if my efforts would pay off due to the same IP Address for all the sites. Any insight on this would be greatly appreciated.
Intermediate & Advanced SEO | | HughesDigital0 -
How can I change my website's content on specific pages without affecting ranking for specific keywords?
My client's website (www.nursevillage.com) content has not been touched for 4 years and we are currently ranking #1 for "per diem nursing". They do not want to make any changes to the site in fear that it might decrease our rankings. We want to try to use utilize that keyword ranking on specific pages (www.nursevillage.com/nv/content/careeroptions/perdiem.jsp ) ranking for "per diem nursing" and try redirecting traffic or placing some banners and links on that page to specific pages or other sites related to "per diem nursing" jobs so we can get nurses to apply to our new nursing jobs. Any advice on why "per diem nursing" is ranking so high for us and what we can change on the site without messing up our ranking would be greatly appreciated. Thanks
Intermediate & Advanced SEO | | ryanperea1000