Googlebot on steroids... Why?
-
We launched a new website (www.gelderlandgroep.com). The site contains 500 pages, but some pages (like https://www.gelderlandgroep.com/collectie/) contains filters (so there are a lot possible url parameters). Last week we mentioned a tremendous amount of traffic (25 GB!!) and CPU usage on the server.
2017-12-04 16:11:57 W3SVC66 IIS14 83.219.93.171 GET /collectie model=6511,6901,7780,7830,2105-illusion&ontwerper=henk-vos,foklab 443 - 66.249.76.153 HTTP/1.1 Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Build/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gelderlandgroep.com 200 0 0 9445 501 312
We find out that "Googlebot" was firing many, many requests. At first we did a nslookup for the IPadres where it actually seems to be googlebot.
Second we visited Google Searchconsole and I was really surprised... Googlebot on steroids? Googlebot requested 922.565 different url's and made combinations for every filter/ parameter combination on the site. Why? The sitemap.xml contains 500 url's... The authority of the site isn't very high, no other signal that this is a special website... Why so much "Google resources"?
Of course we will exclude the parameters in SearchConsole, but I never saw a Googlebot activity for a small website like this before! Does anybody have any clue?
Regards Olaf
-
We got an answer from JohnMu - Webmaster Trends Analyst at Google. The reason of crawling is (as we find out) the filters which have infinite variations (one of developers was sleeping), we will correct this. Disallowing in Robot.txt is adviced as the quickest fix to stop the mega-crawling. This case will be used for further research because of the disproportionate capacity usage. You're right, Google initially will crawl everything, but they don't want Googlebot crawling looks like a "mini-Ddos-like attack".
-
Glad to help!
The large volume could well be to do with the way the filters are set up. There is also a possibility you could be sending some sort of authority signal somehow to Google, for instance if it is using the same Search Console as other valued brands or same WHOIS information.
My gut feeling is after the initial crawl the traffic will reduce, if it doesn't, it probably means Google is finding something new to index, may be dynamically created pages?
-
Thanks for your help!
I think you're probably right. The initial crawling must be complete if Google wants to put everything into the right perspective. But we manage en host more than 300 sites, including large A-brand sites. And even at those sites I had not seen this kind of volumes before.
The server logs also show the same amount of request this night (day five). I will keep you posted if this still continues after the weekend.
-
As far as I know, Google will attempt to find every single page it can possibly find regardless of authority. The frequency after the initial crawl will be affected by the site authority, volume and frequency of updates.
Virtually every page on every website that is publicly accessible will be index and rank somewhere, where you rank will be determined by Google ranking factors.
Keep in mind that search console stats will be a few days out of date (2 or 3 days) and it will normally take two or three days to crawl.
-
Mmm, is that correct? I thought that the amount of resources Google will put in crawling your (new) website also depends of it's authority. 9 million url's, for four days now... It seems to bee so much for this small website...
-
I would say your filters are creating pages in their own right, or at least as Google bot sees it. I have seen a similar thing happen on a site redesign. Potentially, if you can access each filter with a URL that could be listed as an individual page, assuming the content is different.
The first time Google crawls your site, it will try to find everything it possibly can to put it in the index, Google will eat data like no tomorrow
At this stage I wouldn't be too worried about it, just keep an eye out for duplicate content. I guess you'll see both graphs dipped down again to normal levels within a few days.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Suggested Screaming Frog configuration to mirror default Googlebot crawl?
Hi All, Does anyone have a suggested Screaming Frog (SF) configuration to mirror default Googlebot crawl? I want to test my site and see if it will return 429 "Too Many Requests" to Google. I have set the User Agent as Googlebot (Smartphone). Is the default SF Menu > Configuration > Speed > Max Threads 5 and Max URLs 2.0 comparable to Googlebot? Context:
Intermediate & Advanced SEO | | gravymatt-se
I had tried NetPeak SEO Spider which did a nice job and had a cool feature that would pause a crawl if it got to many 429. Long Story short, B2B site threw 429 Errors when there should have been no load on a holiday weekend at 1:00 AM.0 -
How do I know if I am correctly solving an uppercase url issue that may be affecting Googlebot?
We have a large e-commerce site (10k+ SKUs). https://www.flagandbanner.com. As I have begun analyzing how to improve it I have discovered that we have thousands of urls that have uppercase characters. For instance: https://www.flagandbanner.com/Products/patriotic-paper-lanterns-string-lights.asp. This is inconsistently applied throughout the site. I directed our website vendor to fix the issue and they placed 301 redirects via a rule to the web.config file. Any url that contains an uppercase character now displays as a lowercase. However, as I use screaming frog to monitor our site, I see all these 301 redirects--thousands of them. The XML sitemap still shows the the uppercase versions. We have had indexing issues as well. So I'm wondering what is the most effective way to make sure that I'm not placing an extra burden on Googlebot when they index our site? Should I have just not cared about the uppercase issue and let it alone?
Intermediate & Advanced SEO | | webrocket0 -
After Receiving a "Googlebot can't access your site" would this stop your site from being crawled?
Hi Everyone,
Intermediate & Advanced SEO | | AMA-DataSet
A few weeks ago now I received a "Googlebot can't access your site..... connection failure rate is 7.8%" message from the webmaster tools, I have since fixed the majority of these issues but iv noticed that all page except the main home page now have a page rank of N/A while the home page has a page rank of 5 still. Has this connectivity issues reduced the page ranks to N/A? or is it something else I'm missing? Thanks in advance.0 -
Best way to view Global Navigation bar from GoogleBot's perspective
Hi, Links in the global navigation bar of our website do not show up when we look at Google cache --> text only version of the page. These links use "style="<a class="attribute-value">display:none;</a>" when we looked at HTML source. But if I use "user agent switcher" add-on in Firefox and set it to Googlebot, the links in global nav are displayed. I am wondering what is the best way to find out if Google can/can not see the links. Thanks for the help! Supriya.
Intermediate & Advanced SEO | | SShiyekar0 -
It appears that Googlebot Mobile will look for mobile redirects from the desktop site, but still use the SEO from the desktop site.
Is the above statement correct? I've read that its better to have different SEO titles & descriptions for mobile sites as users search differently on mobile devices. I've also read it's good to link build, keep text content on mobile sites etc to get the mobile site to rank. If I choose to not have titles & descriptions on my mobile site will Google just rank our desktop version & then redirect a user on a mobile device to our mobile site or should I be adding in titles & descriptions into the mobile site? Thanks so much for any help!
Intermediate & Advanced SEO | | DCochrane0 -
Googlebot on paywall made with cookies and local storage
My question is about paywalls made with cookies and local storage. We are changing a website with free content to a open paywall with a 5 article view weekly limit. The paywall is made to work with cookies and local storage. The article views are stored to local storage but you have to have your cookies enabled so that you can read the free articles. If you don't have cookies enable we would pass an error page (otherwise the paywall would be easy to bypass). Can you say how this affects SEO? We would still like that Google would index all article pages that it does now. Would it be cloaking if we treated Googlebot differently so that when it does not have cookies enabled, it would still be able to index the page?
Intermediate & Advanced SEO | | OPU1 -
Googlebot Can't Access My Sites After I Repair My Robots File
Hello Mozzers, A colleague and I have been collectively managing about 12 brands for the past several months and we have recently received a number of messages in the sites' webmaster tools instructing us that 'Googlebot was not able to access our site due to some errors with our robots.txt file' My colleague and I, in turn, created new robots.txt files with the intention of preventing the spider from crawling our 'cgi-bin' directory as follows: User-agent: * Disallow: /cgi-bin/ After creating the robots and manually re-submitting it in Webmaster Tools (and receiving the green checkbox), I received the same message about Googlebot not being able to access the site, only difference being that this time it was for a different site that I manage. I repeated the process and everything, aesthetically looked correct, however, I continued receiving these messages for each of the other sites I manage on a daily-basis for roughly a 10-day period. Do any of you know why I may be receiving this error? is it not possible for me to block the Googlebot from crawling the 'cgi-bin'? Any and all advice/insight is very much welcome, I hope I'm being descriptive enough!
Intermediate & Advanced SEO | | NiallSmith1 -
We would like to know where googlebots search from - IP location?
We have a client who has two sites for different countries, 1 US, 1 UK and redirects visitors based on IP. In order to make sure that the English site is crawable, we need to know where the googlebot searches from. Is this a US IP or a UK IP for a UK site / server?
Intermediate & Advanced SEO | | AxonnMedia0