Google showing high volume of URLs blocked by robots.txt in in index-should we be concerned?
-
if we search site:domain.com vs www.domain.com, We see: 130,000 vs 15,000 results. When reviewing the site:domain.com results, we're finding that the majority of the URLs showing are blocked by robots.txt. They are subdomains that we use as production environments (and contain similar content as the rest of our site).
And, we also find the message "In order to show you the most relevant results, we have omitted some entries very similar to the 541 already displayed." SEER Interactive mentions that this is one way to gauge a Panda penalty: http://www.seerinteractive.com/blog/100-panda-recovery-what-we-learned-to-identify-issues-get-your-traffic-back
We were hit by Panda some time back--is this an issue we should address? Should we unblock the subdomains and add noindex, follow?
-
I think it's worth it. I'm not sure what CMS you're using, but it shouldn't take much time to add noindex,follow to the header of all your pages, and then remove the robots.txt directive that's preventing them from being crawled.
-
thanks--I am concerned about if we should go through the process of unblocking them--they are all showing in the SERPs with the "This URL is blocked by robots.txt"--is it worrisome that such a large % of our URLs in the SERPs are showing as blocked by robots.txt with the "omitted from search results" message?
-
If Google has already crawled/indexed the subdomains before, then adding noindex, follow is probably the best approach. This is because if you just block the sites with robots.txt, Google will still know that they pages exist, but won't be able to crawl them, resulting in it taking a long time for the pages to be de-indexed, if ever. Additionally, if those subdomains have any links, then that link value is lost because Google can't crawl the pages.
Adding noindex,follow will tell Google definitely to remove those subdomains from their index, as well as help preserve any link equity they've accumulated.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
MOZ is showing that I have non- indexed blog tag posts are they supposed to be nonindexed. My articles are indexed just not the blog tags that take you to other similar articles do I need to fix this or is it ok?
MOZ is showing that my blog post tags are not indexed my question is should they be indexed? my articles are indexed just not the tags that take you to posts that are similar. Do I need to fix this or not? Thank you
Intermediate & Advanced SEO | | Tyler58910 -
Google Indexing
Hi We have roughly 8500 pages in our website. Google had indexed almost 6000 of them, but now suddenly I see that the pages indexed has gone to 45. Any possible explanations why this might be happening and what can be done for it. Thanks, Priyam
Intermediate & Advanced SEO | | kh-priyam0 -
Meta robots or robot.txt file?
Hi Mozzers! For parametric URL's would you recommend meta robot or robot.txt file?
Intermediate & Advanced SEO | | eLab_London
For example: http://www.exmaple.com//category/product/cat no./quickView I want to stop indexing /quickView URLs. And what's the real difference between the two? Thanks again! Kay0 -
Google News URL Structure
Hi there folks I am looking for some guidance on Google News URLs. We are restructuring the site. A main traffic driver will be the traffic we get from Google News. Most large publishers use: www.site.com/news/12345/this-is-the-title/ Others use www.example.com/news/celebrity/12345/this-is-the-title/ etc. www.example.com/news/celebrity-news/12345/this-is-the-title/ www.example.com/celebrity-news/12345/this-is-the-title/ (Celebrity is a channel on Google News so should we try and follow that format?) www.example.com/news/celebrity-news/this-is-the-title/12345/ www.example.com/news/celebrity-news/this-is-the-title-12345/ (unique ID no at the end and part of the title URL) www.example.com/news/celebrity-news/celebrity-name/this-is-the-title-12345/ Others include the date. So as you can see there are so many combinations and there doesnt seem to be any unity across news sites for this format. Have you any advice on how to structure these URLs? Particularly if we want to been seen as an authority on the following topics: fashion, hair, beauty, and celebrity news - in particular "celebrity name" So should the celebrity news section be www.example.com/news/celebrity-news/celebrity-name/this-is-the-title-12345/ or what? This is for a completely new site build. Thanks Barry
Intermediate & Advanced SEO | | Deepti_C0 -
Soft 404's from pages blocked by robots.txt -- cause for concern?
We're seeing soft 404 errors appear in our google webmaster tools section on pages that are blocked by robots.txt (our search result pages). Should we be concerned? Is there anything we can do about this?
Intermediate & Advanced SEO | | nicole.healthline4 -
Google Webmaster Now Shows YourMost Recent Links
I just saw this story today about a new Google Webmaster feature which lets you download a file of the most recent links. http://searchengineland.com/google-now-shows-you-your-most-recent-links-127903 I downloaded the file today and I already discovered a major site issue. Our site blog was completely duplicated on a secondary domain we own and Google was showing that site as recent links. I already emailed the dev team to fix this pronto. Anybody else using this new feature and perhaps can share if it helps you in any way.
Intermediate & Advanced SEO | | irvingw1 -
Can I use a "no index, follow" command in a robot.txt file for a certain parameter on a domain?
I have a site that produces thousands of pages via file uploads. These pages are then linked to by users for others to download what they have uploaded. Naturally, the client has blocked the parameter which precedes these pages in an attempt to keep them from being indexed. What they did not consider, was they these pages are attracting hundreds of thousands of links that are not passing any authority to the main domain because they're being blocked in robots.txt Can I allow google to follow, but NOT index these pages via a robots.txt file --- or would this have to be done on a page by page basis?
Intermediate & Advanced SEO | | PapaRelevance0