Disallow statement - is this tiny anomaly enough to render Disallow invalid?

lzhao

Google site search (site:'hbn.hoovers.com') indicates 171,000 results for this subdomain. That is not a desired result - this site has 100% duplicate content. We don't want SEs spending any time here.

Robots.txt is set up mostly right to disallow all search engines from indexing this site. That asterisk at the end of the disallow statement looks pretty harmless - but could that be why the site has been indexed?

User-agent: *
Disallow: /*

lzhao

Interesting. I'd never heard that before.

We've never had GA or GWT on these mirror sites before, so it's hard to say what Google is doing these days.

But the goal is definitely to make them and their contents invisible to SEs. We'll get GWT on there and start removing URLs.

Thanks!

WilliamKammer

The additional asterisk shouldn't do you any harm, although standard practice seems to be just putting the "/".

Does it seem like Google is still crawling this subdomain when you look at webmasters crawl stats? While the disallow function in robots.txt will usually stop bots from crawling, it doesn't prevent them from indexing or keeping pages indexed that were before the disallow was put in place. If you want these pages removed from the index, you can request it through webmasters and also use meta robots noindex as opposed to the robots.txt file. Moz has a good article about it here: http://moz.com/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts

If you're just worried about bots crawling the subdomain, it's possible they've already stopped crawling it, but continue to index it due to history or additional indicators suggesting they should index it.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Disallow statement - is this tiny anomaly enough to render Disallow invalid?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Fetch and Render misses middle chunk of page

Google Search Console Site Map Anomalies (HTTP vs HTTPS)

Trying to ensure AJAX rendered categories remain SEO friendly.

Robots.txt anomaly

Robots.txt to disallow /index.php/ path

Googlebot does not obey robots.txt disallow

Disallowing https URLs

Should I set up a disallow in the robots.txt for catalog search results?