Accidentally blocked Googlebot for 14 days
-
Today after I noticed a huge drop in organic traffic to inner pages of my sites, I looked into the code and realized a bug in last commit cause the server to showing captcha pages to all Googlebot requests from Apr 24.
My site has more than 4,000,000 in the index. Before last code change, Googlebot are exempt from being shown the captcha requests so each inner pages are crawled and indexed perfectly with no problem.
The bug broke the whitelisting mechanism and treat requests from Google's ip addresses the same as regular users. It leads to the captcha page being crawled when Googlebot visits thousands of my site's inner pages. This makes Google thinks all my inner pages are identical to each other. Google remove all the inner pages from SERP starting from May 5th before when many of those inner pages have good rankings.
I formerly thought this was a manual or algorithm penalty but
1. I did not receive a warning message in GWT
2. The ranking for main url is good.I tried with "Fetch as Google" in GWT and realize all Googlebot saw in the past 14 days are the same captcha page for all my inner pages.
Now, I have fixed the bug and updated the production site. I just wanted to ask:
1. How long will it take for Google to remove the "duplicated content" flag on my inner pages and show them in SERP again? From my experience, Googlebot revisits urls quite often. But once a url is flagged as "contains similar content", it could be difficult to recover, is it correct?
2. Besides waiting for Google to update its index, what else can I do right now?
Thanks in advance for your answers.
-
Thanks for the info. My site has current crawl rate at 350,00 pages per day so will take 10-20 days to crawl the entire sites.
Most of organic traffic comes to 10,000 urls while others are pagination urls etc. Now all the traffic 1st inner page of each term disappeared in the results of inurl: command.
-
One of my competitors made this type of error and we figured it out right away when their site dropped from the SERPs. It took them a couple weeks to figure it out and make the change. We were hoping that they never figured it out so we could rake in lots of dough. When they fixed it they were back in the SERPs at full strength within a couple of days.... . but they had 40 indexed pages instead of 4,000,000.
I think you will recover well, but might take a while if you don't have a lot of deep links.
Good luck.
-
Pretty much all you can do is wait for Google to recrawl your entire site. You can try re-submitting your site in Webmaster Tools (Health -> Fetch As Google). Getting links from other sites will help speed up the crawling as well. Links from social sites like Twitter/Google+ can help with crawling also.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Pages Crawl Per Day Gone Drasitcaly Down, is it google issue?
Hello Expert, In search console in Crawl Stats Pages Crawl per day going day by day i.e. from 4 lac pages per day now it is reduce upto 2 lac in last 15 days. So where is the issue? Where I am going wrong or it is issue from google end? Thanks!
Technical SEO | | Johny123450 -
HTTP Status showing up in opensiteexplorer top pages as blocked by robot.txt file
I am trying to find an answer to this question it has alot of url on this page with no data when i go into the data source and search for noindex or robot.txt but the site is visible in the search engines ?
Technical SEO | | ReSEOlve0 -
Oh no googlebot can not access my robots.txt file
I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site.
Technical SEO | | BistosAmerica0 -
Temporarily suspend Googlebot without blocking users
We'll soon be launching a redesign, on a new platform, migrating millions of pages to new URLs. How can I tell Google (and other crawlers) to temporarily (a day or two) ignore my site? We're hoping to buy ourselves a small bit of time to verify redirects and live functionality before allowing Google to crawl and index the new architecture. GWT's recommendation is to 503 all pages - including robots.txt, but that also makes the site invisible to real site visitors, resulting in significant business loss. Bad answer. I've heard some recommendations to disallow all user agents in robots.txt. Any answer that puts the millions of pages we already have indexed at risk is also a bad answer. Thanks
Technical SEO | | lzhao0 -
My site has dropped three places in one day i do not understand it
Hi our site has always been either number one in google or at the bad times number two for the search word gastric band hypnotherapy but today we have dropped three places and we do not understand what has happened Can anyone please have a look and see what we need to do to get back on top. the site that is above us does not have much information on the subject so we are very puzzled here is our site www.clairehegarty.co.uk/virtual-gastric-band-with-hypnotherapy many thanks
Technical SEO | | ClaireH-1848860 -
Blocking AJAX Content from being crawled
Our website has some pages with content shared from a third party provider and we use AJAX as our implementation. We dont want Google to crawl the third party's content but we do want them to crawl and index the rest of the web page. However, In light of Google's recent announcement about more effectively indexing google, I have some concern that we are at risk for that content to be indexed. I have thought about x-robots but have concern about implementing it on the pages because of a potential risk in Google not indexing the whole page. These pages get significant traffic for the website, and I cant risk. Thanks, Phil
Technical SEO | | AU-SEO0 -
Is blocking RSS Feeds with robots.txt necessary?
Is it necessary to block an rss feed with robots.txt? It seems they are automatically not indexed (http://googlewebmastercentral.blogspot.com/2007/12/taking-feeds-out-of-our-web-search.html) And, google says here that it's important not to block RSS feeds (http://googlewebmastercentral.blogspot.com/2009/10/using-rssatom-feeds-to-discover-new.html) I'm just checking!
Technical SEO | | nicole.healthline0