Disallow statement - is this tiny anomaly enough to render Disallow invalid?
-
Google site search (site:'hbn.hoovers.com') indicates 171,000 results for this subdomain. That is not a desired result - this site has 100% duplicate content. We don't want SEs spending any time here.
Robots.txt is set up mostly right to disallow all search engines from indexing this site. That asterisk at the end of the disallow statement looks pretty harmless - but could that be why the site has been indexed?
User-agent: * Disallow: /*
-
Interesting. I'd never heard that before.
We've never had GA or GWT on these mirror sites before, so it's hard to say what Google is doing these days.
But the goal is definitely to make them and their contents invisible to SEs. We'll get GWT on there and start removing URLs.
Thanks!
-
The additional asterisk shouldn't do you any harm, although standard practice seems to be just putting the "/".
Does it seem like Google is still crawling this subdomain when you look at webmasters crawl stats? While the disallow function in robots.txt will usually stop bots from crawling, it doesn't prevent them from indexing or keeping pages indexed that were before the disallow was put in place. If you want these pages removed from the index, you can request it through webmasters and also use meta robots noindex as opposed to the robots.txt file. Moz has a good article about it here: http://moz.com/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts
If you're just worried about bots crawling the subdomain, it's possible they've already stopped crawling it, but continue to index it due to history or additional indicators suggesting they should index it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallowing URL Parameters vs. Canonicalizing
Hi all, I have a client that has a unique search setup. So they have Region pages (/state/city). We want these indexed and are using self-referential canonicals. They also have a search function that emulates the look of the Region pages. When you search for, say, Los Angeles, the URL changes to _/search/los+angeles _and looks exactly like /ca/los-angeles. These search URLs can also have parameters (/search/los+angeles?age=over-2&time[]=part-time), which we obviously don't want indexed. Right now my concern is how best to ensure the /search pages don't get indexed and we don't get hit with duplicate content penalties. The options are this: Self-referential canonicals for the Region pages, and disallow everything after the second slash in /search/ (so the main search page is indexed) Self-referential canonicals for the Region pages, and write a rule that automatically canonicalizes all other search pages to /search. Potential Concern: /search/ URLs are created even with misspellings. Thanks!
Technical SEO | | Alces1 -
Robots.txt Disallow: / in Search Console
Two days ago I found out through search console that my website's Robots.txt has changed to User-agent: *
Technical SEO | | RAN_SEO
Disallow: / When I check the robots.txt in the website it looks fine - I see its blocked just in search console( in the robots.txt tester). when I try to do fetch as google to the homepage I see its blocked. Any ideas why would robots.txt block my website? it was fine until the weekend. before that, in the last 3 months I saw I had blocked resources in the website and I brought back pages with fetch as google. Any ideas?0 -
Webmaster Tools Data Discrepancies/Anomalies
Hi Im looking in GWT account for a client of mine and see that in the index status area 400+ pages are indexed so all seems ok there! But then in the sitemap area 111 pages have been submitted but only 1 indexed ! Any ideas whats going on here ? Cheers Dan
Technical SEO | | Dan-Lawrence0 -
Is Noindex Enough To Solve My Duplicate Content Issue?
Hello SEO Gurus! I have a client who runs 7 web properties. 6 of them are satellite websites, and 7th is his company's main website. For a long while, my company has, among other things, blogged on a hosted blog at www.hismainwebsite.com/blog, and when we were optimizing for one of the other satellite websites, we would simply link to it in the article. Now, however, the client has gone ahead and set up separate blogs on every one of the satellite websites as well, and he has a nifty plug-in set up on the main website's blog that pipes in articles that we write to their corresponding satellite blog as well. My concern is duplicate content. In a sense, this is like autoblogging -- the only thing that doesn't make it heinous is that the client is autoblogging himself. He thinks that it will be a great feature for giving users to his satellite websites some great fresh content to read -- which I agree, as I think the combination of publishing and e-commerce is a thing of the future -- but I really want to avoid the duplicate content issue and a possible SEO/SERP hit. I am thinking that a noindexing of each of the satellite websites' blog pages might suffice. But I'd like to hear from all of you if you think that even this may not be a foolproof solution. Thanks in advance! Kind Regards, Mike
Technical SEO | | RCNOnlineMarketing0 -
Will invalid HTML code generated by WordPress affect SEO efforts?
Hi all, I'm new to SEOmoz and SEO in general really. I run a small but well regarded freelance website and graphic design business, and until very recently had an employee who handled the SEO side of things. I'm now looking to step into this role myself and hopefully learn the in's and out's of SEO. I've no doubt there will be much to learn, but the SEOmoz tools and it's community seem excellent and helpful. My question then is basically, if WordPress generated HTML code can have an effect on SEO, when it's reported as invalid by tools such as the W3C HTML validator? I'm used to hand coding the majority of my websites for clients, where creating valid HTML and CSS code is something I can do with relative ease. A new client however wants to use WordPress - for ease of updating the site content themselves. The client does however consider any potential SEO implications to be a very important factor in choosing a hand coded vs. WordPress based website. I am aware that WordPress itself is just a means of generating HTML code, and that to the search engines there is no difference between this and the hand coded websites I usually produce. However if WordPress is generating HTML that is being reported as invalid, would this make the search engines penalise the site? On a second note, will the search engines look negatively on a WordPress site where it is being used as a standard website, and the content may not be updated as frequently, as say, a blog? Thanks for your time, and I look forward to hearing your suggestions.
Technical SEO | | SavilleWolf0 -
Should search pages be disallowed in robots.txt?
The SEOmoz crawler picks up "search" pages on a site as having duplicate page titles, which of course they do. Does that mean I should put a "Disallow: /search" tag in my robots.txt? When I put the URL's into Google, they aren't coming up in any SERPS, so I would assume everything's ok. I try to abide by the SEOmoz crawl errors as much as possible, that's why I'm asking. Any thoughts would be helpful. Thanks!
Technical SEO | | MichaelWeisbaum0 -
Can I Disallow Faceted Nav URLs - Robots.txt
I have been disallowing /*? So I know that works without affecting crawling. I am wondering if I can disallow the faceted nav urls. So disallow: /category.html/? /category2.html/? /category3.html/*? To prevent the price faceted url from being cached: /category.html?price=1%2C1000
Technical SEO | | tylerfraser
and
/category.html?price=1%2C1000&product_material=88 Thanks!0 -
Link Juice passing through a redirect of a disallowed URL
Hey guys! Suppose I disallow search bots from indexing anything on my secure server in my robots.txt, and 301 redirect all of my secure server traffic to my non-secure site. Will the search bots see the redirect before they realize that they're disallowed from accessing that page? Or will they see that page is disallowed and not follow the redirect? Should I change my robots.txt to allow search bots to crawl my secure site so they can find the redirects?
Technical SEO | | john4math0