Googlebot on steroids... Why?
-
We launched a new website (www.gelderlandgroep.com). The site contains 500 pages, but some pages (like https://www.gelderlandgroep.com/collectie/) contains filters (so there are a lot possible url parameters). Last week we mentioned a tremendous amount of traffic (25 GB!!) and CPU usage on the server.
2017-12-04 16:11:57 W3SVC66 IIS14 83.219.93.171 GET /collectie model=6511,6901,7780,7830,2105-illusion&ontwerper=henk-vos,foklab 443 - 66.249.76.153 HTTP/1.1 Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Build/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gelderlandgroep.com 200 0 0 9445 501 312
We find out that "Googlebot" was firing many, many requests. At first we did a nslookup for the IPadres where it actually seems to be googlebot.
Second we visited Google Searchconsole and I was really surprised... Googlebot on steroids? Googlebot requested 922.565 different url's and made combinations for every filter/ parameter combination on the site. Why? The sitemap.xml contains 500 url's... The authority of the site isn't very high, no other signal that this is a special website... Why so much "Google resources"?
Of course we will exclude the parameters in SearchConsole, but I never saw a Googlebot activity for a small website like this before! Does anybody have any clue?
Regards Olaf
-
We got an answer from JohnMu - Webmaster Trends Analyst at Google. The reason of crawling is (as we find out) the filters which have infinite variations (one of developers was sleeping), we will correct this. Disallowing in Robot.txt is adviced as the quickest fix to stop the mega-crawling. This case will be used for further research because of the disproportionate capacity usage. You're right, Google initially will crawl everything, but they don't want Googlebot crawling looks like a "mini-Ddos-like attack".
-
Glad to help!
The large volume could well be to do with the way the filters are set up. There is also a possibility you could be sending some sort of authority signal somehow to Google, for instance if it is using the same Search Console as other valued brands or same WHOIS information.
My gut feeling is after the initial crawl the traffic will reduce, if it doesn't, it probably means Google is finding something new to index, may be dynamically created pages?
-
Thanks for your help!
I think you're probably right. The initial crawling must be complete if Google wants to put everything into the right perspective. But we manage en host more than 300 sites, including large A-brand sites. And even at those sites I had not seen this kind of volumes before.
The server logs also show the same amount of request this night (day five). I will keep you posted if this still continues after the weekend.
-
As far as I know, Google will attempt to find every single page it can possibly find regardless of authority. The frequency after the initial crawl will be affected by the site authority, volume and frequency of updates.
Virtually every page on every website that is publicly accessible will be index and rank somewhere, where you rank will be determined by Google ranking factors.
Keep in mind that search console stats will be a few days out of date (2 or 3 days) and it will normally take two or three days to crawl.
-
Mmm, is that correct? I thought that the amount of resources Google will put in crawling your (new) website also depends of it's authority. 9 million url's, for four days now... It seems to bee so much for this small website...
-
I would say your filters are creating pages in their own right, or at least as Google bot sees it. I have seen a similar thing happen on a site redesign. Potentially, if you can access each filter with a URL that could be listed as an individual page, assuming the content is different.
The first time Google crawls your site, it will try to find everything it possibly can to put it in the index, Google will eat data like no tomorrow
At this stage I wouldn't be too worried about it, just keep an eye out for duplicate content. I guess you'll see both graphs dipped down again to normal levels within a few days.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Suggested Screaming Frog configuration to mirror default Googlebot crawl?
Hi All, Does anyone have a suggested Screaming Frog (SF) configuration to mirror default Googlebot crawl? I want to test my site and see if it will return 429 "Too Many Requests" to Google. I have set the User Agent as Googlebot (Smartphone). Is the default SF Menu > Configuration > Speed > Max Threads 5 and Max URLs 2.0 comparable to Googlebot? Context:
Intermediate & Advanced SEO | | gravymatt-se
I had tried NetPeak SEO Spider which did a nice job and had a cool feature that would pause a crawl if it got to many 429. Long Story short, B2B site threw 429 Errors when there should have been no load on a holiday weekend at 1:00 AM.0 -
What IP Address does Googlebot use to read your site when coming from an external backlink?
Hi All, I'm trying to find more information on what IP address Googlebot would use when arriving to crawl your site from an external backlink. I'm under the impression Googlebot uses international signals to determine the best IP address to use when crawling (US / non-US) and then carries on with that IP when it arrives to your website? E.g. - Googlebot finds www.example.co.uk. Due to the ccTLD, it decides to crawl the site with a UK IP address rather than a US one. As it crawls this UK site, it finds a subdirectory backlink to your website and continues to crawl your website with the aforementioned UK IP address. Is this a correct assumption, or does Googlebot look at altering the IP address as it enters a backlink / new domain? Also, are ccTLDs the main signals to determine the possibility of Google switching to an international IP address to crawl, rather than the standard US one? Am I right in saying that hreflang tags don't apply here at all, as their purpose is to be used in SERPS and helping Google to determine which page to serve to users based on their IP etc. If anyone has any insight this would be great.
Intermediate & Advanced SEO | | MattBassos0 -
Robots.txt - Googlebot - Allow... what's it for?
Hello - I just came across this in robots.txt for the first time, and was wondering why it is used? Why would you have to proactively tell Googlebot to crawl JS/CSS and why would you want it to? Any help would be much appreciated - thanks, Luke User-Agent: Googlebot Allow: /.js Allow: /.css
Intermediate & Advanced SEO | | McTaggart0 -
Mobile Googlebot vs Desktop Googlebot - GWT reports - Crawl errors
Hi Everyone, I have a very specific SEO question. I am doing a site audit and one of the crawl reports is showing tons of 404's for the "smartphone" bot and with very recent crawl dates. If our website is responsive, and we do not have a mobile version of the website I do not understand why the desktop report version has tons of 404's and yet the smartphone does not. I think I am not understanding something conceptually. I think it has something to do with this little message in the Mobile crawl report. "Errors that occurred only when your site was crawled by Googlebot (errors didn't appear for desktop)." If I understand correctly, the "smartphone" report will only show URL's that are not on the desktop report. Is this correct?
Intermediate & Advanced SEO | | Carla_Dawson0 -
GoogleBot Mobile & Depagination
I am building a new site for a client and we're discussing their inventory section. What I would like to accomplish is have all their products load on scroll (or swipe on mobile). I have seen suggestions to load all content in the background at once, and show it as they swipe, lazy loading the product images. This will work fine for the user, but what about how GoogleBot mobile crawls the page? Will it simulate swiping? Will it load every product at once, killing page load times b/c of all of the images it must load at once? What are considered SEO best practices when loading inventory using this technique. I worry about this b/c it's possible for 2,000+ results to be returned, and I don't want GoogleBot to try and load all those results at once (with their product thumbnail images). And I know you will say to break those products up into categories, etc. But I want the "swipe for more" experience. 99.9% of our users will click a category or filter the results, but if someone wants to swipe through all 2,000 items on the main inventory landing page, they can. I would rather have this option than "Page 1 of 350". I like option #4 in this question, but not sure how Google will handle it. http://ux.stackexchange.com/questions/7268/iphone-mobile-web-pagination-vs-load-more-vs-scrolling?rq=1 I asked Matt Cutts to answer this, if you want to upvote this question. 🙂
Intermediate & Advanced SEO | | nbyloff
https://www.google.com/moderator/#11/e=adbf4&u=CAIQwYCMnI6opfkj0 -
Googlebot on paywall made with cookies and local storage
My question is about paywalls made with cookies and local storage. We are changing a website with free content to a open paywall with a 5 article view weekly limit. The paywall is made to work with cookies and local storage. The article views are stored to local storage but you have to have your cookies enabled so that you can read the free articles. If you don't have cookies enable we would pass an error page (otherwise the paywall would be easy to bypass). Can you say how this affects SEO? We would still like that Google would index all article pages that it does now. Would it be cloaking if we treated Googlebot differently so that when it does not have cookies enabled, it would still be able to index the page?
Intermediate & Advanced SEO | | OPU1 -
Googlebot crawling partial URLs
Hi guys, I've checked my email this morning and I've got a number of 404 errors over the weekend where Google has tried to crawl some of my existing pages but not found the full URL. Instead of hitting 'domain.com/folder/complete-pagename.php' it's hit 'domain.com/folder/comp'. This is definitely Googlebot/2.1; http://www.google.com/bot.html (66.249.72.53) but I can't find where it would have found only the partial URL. It certainly wasn't on the domain it's crawling and I can't find any links from external sites pointing to us with the incorrect URL. GoogleBot is doing the same thing across a single domain but in different sub-folders. Having checked Webmaster Tools there aren't any hard 404s and the soft ones aren't related and haven't occured since August. I'm really confused as to how this is happening.. Thanks!
Intermediate & Advanced SEO | | panini0 -
Googlebot HTTP 204 Status Code Handling?
If a user runs a search that returns no results, and the server returns a 204 (No Content), will Googlebot treat that as the rough equivalent of a 404 or a noindex? If not, then it seems one would want to noindex the page to avoid low quality penalties, but that might require more back and forth with the server, which isn't ideal. Kurus
Intermediate & Advanced SEO | | kurus0