Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Block Domain in robots.txt
-
Hi.
We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off...
Any ideas why this could be and whether it's normal?
I can send you more domain infos by personal message if you want to have a look at it.
-
Hi Philipp,
I have not heard of Google going rogue like this before, however I have seen it with other search engines (Baidu).
I would first verify that the robots.txt is configured correctly, and verify there is no links anywhere to the domain. The reason I mentioned this prior, was due to this official notification on Google: https://support.google.com/webmasters/answer/156449?rd=1
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
My next thought would be, did Google start crawling the site before the robots.txt blocked them from doing so? This may have caused Google to start the indexing process which is not instantaneous, then you have the new urls appear after the robots.txt went into effect. The solution is add the meta tag noindex, or block put an explicit block on the server as I mention above.
If you are worried about duplicate content issues you maybe able to at least canonical the subdomain urls to the correct url.
Hope that helps and good luck
-
Hi Don
Thanks for your hint. It doesn't look like there are any links to the www1 subdomain. Also, since we've let the www1-Subdomain return 404's and blocked it with robots, the indexed pages increased from 39'300 to 45'100 so this is more than anybody would link to... Really strange why Google just ignores robots and keeps indexing...
-
Hi Phil,
Is it possible that google is find the links on another site (like somebody else has your links on their site)? Depending on your situation a good catch all block is to secure the www1 domain with (.htaccess/**.**htpasswd ) this would force anybody (even bots) to provide credentials to see or explore the site. Of course everybody who needs access to the site would have the credentials. So in theory you shouldn't see any more urls getting indexed.
Hope that helps,Don
-
Thanks for the resource Chris! The strange thing is that Google keeps indexing new URLs even though it is clearly blocked via robots.txt...
But I guess I'll just wait for these 90 days to pass then...
-
Phillip,
If you've deleted the URLs, there's not much else for you to do. You're experiencing the lag between when Google crawls and indexes pages new pages and when it finds and removes a 404 URL from it's index.
You should think 90 days as an approximate time frame for your page count in the index to start dropping. Here's more from google:
https://support.google.com/webmasters/answer/1663419
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl solutions for landing pages that don't contain a robots.txt file?
My site (www.nomader.com) is currently built on Instapage, which does not offer the ability to add a robots.txt file. I plan to migrate to a Shopify site in the coming months, but for now the Instapage site is my primary website. In the interim, would you suggest that I manually request a Google crawl through the search console tool? If so, how often? Any other suggestions for countering this Meta Noindex issue?
Technical SEO | | Nomader1 -
How can you promote a sub-domain ahead of a domain on the SERPs?
I have a new client that wants to promote their subdomain uk.imagemcs.com and have their main domain imagemcs.com fall off the SERPs. Objective? Get uk.imagemcs.com to rank first for UK 'brand' searches. Do a search for 'imagem creative services' and you should see the issue (it looks like rules have been applied to the robots.txt on the main domain to exclude any bots from crawling - but since they've been indexed previously I need to take action as it doesn't look great!). I think I can do this by applying a permanent redirect from the main domain to the subdomain at domain level and then no-indexing the site - and then resubmit the sitemap. My slight concern is that this no-indexing of the main domain may impact on the visibility of the subdomains (I'm dealing with uk.imagemcs.com, but there is us.imagemcs.com and de.imagemcs.com) and was looking for some assurance that this would not be the case. My understanding is that subdomains are completely distinct from domains and as such this action should have no impact on the subdomains. I asked the question on the Webmasters Forum but haven't really got anywhere
Technical SEO | | nathangdavidson2
https://productforums.google.com/forum/#!msg/webmasters/1Avupy3Uw_o/hu6oLQntCAAJ Can anyone suggest a course of action? many thanks, Nathan0 -
Robots txt. in page with 301 redirect
We currently have a a series of help pages that we would like to disallow from our robots txt. The thing is that these help pages are located in our old website, which now has a 301 redirect to current site. Which is the proper way to go around? 1- Add the pages we want to disallow to the robots.txt of the new website? 2- Break the redirect momentarily and add the pages to the robots.txt of the old one? Thanks
Technical SEO | | Kilgray0 -
Oh no googlebot can not access my robots.txt file
I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site.
Technical SEO | | BistosAmerica0 -
Allow or Disallow First in Robots.txt
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command? example: Allow: /models/ford///page* Disallow: /models////page
Technical SEO | | irvingw0 -
Any way around buying hosting for an old domain to 301 redirect to a new domain?
Howdy. I have just read this QA thread, so I think I have my answer. But I'm going to ask anyway! Basically DomainA.com is being retired, and DomainB.com is going to be launched. We're going to have to redirect numerous URLs from DomainA.com to DomainB.com. I think the way to go about this is to continue paying for hosting for DomainA.com, serving a .htaccess from that hosting account, and then hosting DomainB.com separately. Anybody know of a way to avoid paying for hosting a .htaccess file on DomainA.com? Thanks!
Technical SEO | | SamTurri0 -
Subdomain and Domain Rankings
I have read here that domain names with keywords might add a boost to your search rank For instance using a completely inane example monkey-fights.com might get a boost compared to mfl.com (monkey fighting league) when searching for "monkey fights" There seems to be a hot debate as to how much bonus the first domain might get over the second, but leaving that aside for the moment. Question 1. Would monkey-fights.mfl.com get the same kind of bonus as a root domain bonus? Question 2. If the answer to 1 above was yes would a 301 redirect from the suddomain URL to root domain URL retain that bonus I was just thinking on how hard it is to get root domains these days that are not either being squatted on etc. and if this might be a way to get the same bonus, or maybe subdomains are less bonus prone and so it would be a waste of time Thanks
Technical SEO | | bThere0 -
Should I set up a disallow in the robots.txt for catalog search results?
When the crawl diagnostics came back for my site its showing around 3,000 pages of duplicate content. Almost all of them are of the catalog search results page. I also did a site search on Google and they have most of the results pages in their index too. I think I should just disallow the bots in the /catalogsearch/ sub folder, but I'm not sure if this will have any negative effect?
Technical SEO | | JordanJudson0