Google has deindexed 40% of my site because it's having problems crawling it
-
Hi
Last week i got my fifth email saying 'Google can't access your site'. The first one i got in early November. Since then my site has gone from almost 80k pages indexed to less than 45k pages and the number is lowering even though we post daily about 100 new articles (it's a online newspaper).
The site i'm talking about is http://www.gazetaexpress.com/
We have to deal with DDoS attacks most of the time, so our server guy has implemented a firewall to protect the site from these attacks. We suspect that it's the firewall that is blocking google bots to crawl and index our site. But then things get more interesting, some parts of the site are being crawled regularly and some others not at all. If the firewall was to stop google bots from crawling the site, why some parts of the site are being crawled with no problems and others aren't?
In the screenshot attached to this post you will see how Google Webmasters is reporting these errors.
In this link, it says that if 'Error' status happens again you should contact Google Webmaster support because something is preventing Google to fetch the site. I used the Feedback form in Google Webmasters to report this error about two months ago but haven't heard from them. Did i use the wrong form to contact them, if yes how can i reach them and tell about my problem?
If you need more details feel free to ask. I will appreciate any help.
Thank you in advance
-
Great news - strange that these 608 errors didn't appear while crawling the site with Screaming Frog.
-
We found the problem. It was about website compression (GZIP). I found this after crawling my site with Moz, and saw lot's of pages with 608 Error code. Then i searched in Google and saw a response by Dr. Pete in another question here in Moz Q/A (http://moz.com/community/q/how-do-i-fix-608-s-please)
After we removed the GZIP, Google could crawl the site with no problems.
-
Dirk
Thanks a lot for your help. Unfortunately the problem remains the same. More than 65% of site has been de-indexed and it's making our work very difficult.
I'm hoping that somebody here might have any idea of what is causing this so we can find a solution to fix it.
Thank you all for your time.
-
Hi
Not sure if the indexing problem is solved now, but I did a few other checks. Most of the tools I used where able to capture the problem url without much issues even from California ip's & simulating Google bot.
I noticed that some of the pages (example http://www.gazetaexpress.com/fun/) are quite empty if you browse them without Javascript active. Navigating through the site with Javascript is extremely slow, and a lot of links don't seem to respond. When trying to go from /fun/ to /sport/ without Javascript - I got a 504 Gateway Time-out
Normally Google is now capable of indexing content by executing the javascript, but it's always better to have a non-javascript fallback that can always be indexed (http://googlewebmastercentral.blogspot.be/2014/05/understanding-web-pages-better.html) - the article states explicitly
- If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources.
This could be the reason for the strange errors when trying to fetch like Google.
Hope this helps,
Dirk
-
Hi Dirk
Thanks a lot for your reply.
Today we turned off the firewall for a couple hours and tried to fetch the site as Google. It didn't work. The results we're the same as before.
This problem is starting to be pretty ugly since Google has started now not showing our mobile results as 'mobile-friendly' even though we have a mobile version of site, we are using rel=canonical and rel=alternate and 302 redirects for mobile users from desktop pages to mobile ones when they are browsing via smartphone.
Any other idea what might be causing this?
Thanks in advance
-
Hi,
It seems that you're pages are extremely heavy to load - I did 2 tests - on your homepage & on the /moti-sot page
Your homepage needed a whopping 73sec to load (http://www.webpagetest.org/result/150312_YV_H5K/1/details/) - the moti-sot page is quicker - but 8sec is still rather high (http://www.webpagetest.org/result/150312_SK_H9M/)
I sometimes noticed a crash of the Shockwave flash plugin, but not sure if this is related to your problem;I crawled your site with Screaming Frog, but it didn't really find any indexing problems - while you have a lot of pages very deep in your sitestructure, the bot didn't seem to have any specific troubles to access your page. Websniffer returns a normal 200 code when checking your sites - even with useragent "Google"
So I guess you're right about the firewall - may be it's blocking the ip addresses used by Google bot - do you have reporting from the firewall which traffic is blocked? Try to search for the useragent Googlebot in your logfiles and see if this traffic is rejected. The fact that some sections are indexed and others not could be related to the configuration of the firewall, and/or the ip addresses used by Google bot to check your site (the bot is not always using the same ip address)
Hope this helps,
Dirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Some bots excluded from crawling client's domain
Hi all! My client is in healthcare in the US and for HIPAA reasons, blocks traffic from most international sources. a. I don't think this is good for SEO b. The site won't allow Moz bot or Screaming Frog bot to crawl it. It's so frustrating. We can't figure out what mechanism they are utilizing to execute this. Any help as we start down the rabbit hole to remedy is much appreciated. thank you!
Technical SEO | | SimpleSearch0 -
Site Migration between CMS's
Hi There, I have a technical question about migrating CMS's but not servers. My client has site A on Joomla install, He want's ot migrate to Wordpress and we will call this site B. As he has a lot of old content on site A he doesn't want to lose, he has put site B (wordpress install) on a subdirectory site.com/siteb (for example). and will use a htaccess to forward the root domain to this wordpress site. Therefore anyone going to www.site.com will see the new wordpress site and the old content and joomla install will sit on the root of the server. Will Google have an issue with this? Will it even find the old content? what are the issues for the new site and new content? Look forward getting your guys input
Technical SEO | | nezona1 -
What's wrong with this robots.txt
Hi. really struggling with the robots.txt file
Technical SEO | | Leonie-Kramer
this is it: User-agent: *
Disallow: /product/ #old sitemap
Disallow: /media/name.xml When testing in w3c.org everything looks good, testing is okay, but when uploading it to the server, Google webmaster tools gives 3 errors. Checked it with my collegue we both don't know what's wrong. Can someone take a look at this and give me the solution.
Thanx in advance! Leonie1 -
Need advice for new site's structure
Hi everyone, I need to update the structure of my site www.chedonna.it Basicly I've two main problems: 1. I've 61.000 index tag (more with no post)2. The category of my site are noindex I thought to fix my problem making the category index and the tag noindex, but I'm not sure if this is the best solution because I've a great number of tag idexed by Google for a long time. Mybe it is correct just to make the category index and linking it from the post and leave the tag index. Could you please let me know what's your opinion? Regards.
Technical SEO | | salvyy0 -
Structuring URL's for better SEO
Hello, We were rolling our fresh urls for our new service website. Currently we have our structure as www.practo.com/health/dental/clinic/bangalore We like to have it as www.practo.com/health/dental-clinic-bangalore Can someone advice us better which one of the above structure would work out better and why? Should this be a focus of attention while going ahead since this is like a search engine platform for patients looking out for actual doctors. Thanks, Aditya
Technical SEO | | shanky10 -
A site is not being indexed by Google Yahoo or Bing
This site - http://adoptionconnection.org/ is not being indexed by any of the search engines. I checked the easy stuff - robots text is: <meta name="<a class="attribute-value">robots</a>" content="<a class="attribute-value">all, index, follow</a>" /> <meta name="<a class="attribute-value">robots</a>" content="<a class="attribute-value">noodp</a>" /> <meta name="<a class="attribute-value">robots</a>" content="<a class="attribute-value">noydir</a>" /> I have checked what I can determine would cause the issue but have found nothing to prevent it from being indexed. I'm thinking it may be re-directs etc. Any answer would be great. Thanks in advance,
Technical SEO | | Intergen0 -
Redirect Flash Site for Google Only - Is this against TOS?
A photographer client has a flash website, purchased as from a (well respected) template company. The main site is at the root domain, and the HTML version is at www.example.com/?load=html If I visit the site on a browser without Flash installed, I am re-directed automatically to the HTML version. I'm concerned as the site has some great links and the HTML version is well optimised, but doesn't appear anywhere in Google for chosen keywords (ranks perfectly for brand related searches). Google is indexing the Flash version of the site, but I would rather it didn't (there's no real content (just Javascript to load the SWF) and all of the pages load under one URL). How can I block the Flash version from Google but still make the incoming links count towards the HTMl version of the site? If I re-direct Google to the HTML version, is this cloaking, and is it frowned upon? Thanks for any advice you can offer.
Technical SEO | | cmaddison0 -
Google has not indexed my site in over 4 weeks, what's the problem?
We recently put in permanent redirects to our new url, but Google seems to not want to index the new url. There was no problems with the old url and the new url is brand new so should have no 'black marks' against it. We have done everything we can think off in terms of submitting site maps, telling google our url has changed in webmaster tools, mentioning the new url on social sites etc...but still nothing. It has been over 4 weeks now since we set up the redirects to the url, any ideas why Google seems to be choosing not to index it? Thanks
Technical SEO | | cewe0