Rogerbot getting cheeky?
-
Hi SeoMoz,
From time to time my server crashes during Rogerbot's crawling escapades, even though I have a robots.txt file with a crawl-delay 10, now just increased to 20.
I looked at the Apache log and noticed Roger hitting me from from 4 different addresses 216.244.72.3, 72.11, 72.12 and 216.176.191.201, and most times whilst on each separate address, it was 10 seconds apart, ALL 4 addresses would hit 4 different pages simultaneously (example 2). At other times, it wasn't respecting robots.txt at all (see example 1 below).
I wouldn't call this situation 'respecting the crawl-delay' entry in robots.txt as other question answered here by you have stated. 4 simultaneous page requests within 1 sec from Rogerbot is not what should be happening IMHO.
example 1
216.244.72.12 - - [05/Sep/2012:15:54:27 +1000] "GET /store/product-info.php?mypage1.html" 200 77813
216.244.72.12 - - [05/Sep/2012:15:54:27 +1000] "GET /store/product-info.php?mypage2.html HTTP/1.1" 200 74058
216.244.72.12 - - [05/Sep/2012:15:54:28 +1000] "GET /store/product-info.php?mypage3.html HTTP/1.1" 200 69772
216.244.72.12 - - [05/Sep/2012:15:54:37 +1000] "GET /store/product-info.php?mypage4.html HTTP/1.1" 200 82441example 2
216.244.72.12 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage1.html HTTP/1.1" 200 70209
216.244.72.11 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage2.html HTTP/1.1" 200 82384
216.244.72.12 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage3.html HTTP/1.1" 200 83683
216.244.72.3 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage4.html HTTP/1.1" 200 82431
216.244.72.3 - - [05/Sep/2012:15:46:16 +1000] "GET /store/mypage5.html HTTP/1.1" 200 82855
216.176.191.201 - - [05/Sep/2012:15:46:26 +1000] "GET /store/mypage6.html HTTP/1.1" 200 75659Please advise.
-
Hi BM7,
I'm going to open up a ticket on this to have our engineers take a closer look at your site. Once we have an overall response, I'll post it here for other community members to view.
Cheers!
-
Thanks Megan for your reply,
Will give that a try and have blocked 2 addresses so you are reduced to 2 crawler sessions. These two measures should reduce the load considerably as long as Rogerbot respects the 7 second delay.
IMHO ignoring the Crawl-Delay set by the webmaster of the site you are crawling, which crawlers are supposed to respect, is wrong. I got a Google WMT nasty for being down 5 hours due to Rogerbot as it was the middle of the night so only got restarted in the morning.
Also, my site has around 600 discrete pages of which you crawl about 500, so even at the original 10 seconds crawl delay you could do my whole site in less than 1.5 hours, which is only required once a week. So in my mind that suggests there is no need to overrule my settings in robots.txt 'so he (Roger) can complete the crawl'.
Regards,
-
Hi there,
This is Megan from the SEOmoz Help Team. I'm so sorry Rogerbot is causing you grief! This actually might be happening because your crawl delay is too long, so rogerbot just ends up ignoring it so he can complete the crawl. If you set your crawl delay to a max of 7, then it should solve your problem. If you're still running into issues, though, please send us a message to help@seomoz.org and we'll check it out asap!
Cheers!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unable to get into top 20 even when pages are optimized and most crawl issues resolved
I have a few keyword phrases I've been trying to rank in the top 20 for (starting place). I have optimized for a few different phrases, ranging in keyword difficulty, but no matter what I do I can't seem to get in. In many cases, the exact same results show up for many different variations of the phrases I'd like to rank for. I've read about how google tries to match user intent and so if it decides those results are more relevant then it will always show them, but does that mean that no matter what I do I will always be behind them? The main question I have is: how should I proceed? Should I stop optimizing pages and focus on link acquisition? Or go through and make sure there isn't a single crawl issue? Or focus on optimizing for longer tail keyword phrases? It just feels like I've done so much of what the moz tools have recommended and I'm seeing very little movement over the past couple of months, if anything I see dips in performance after optimization. Thanks in advance!
Moz Pro | | Dynata_panel_marketing1 -
I keep getting Authentication Failed on the API
I have the credentials in the URL correctly but it will continue to fail authentication. I will not post them obviously but is there a problem with the API currently? I tried creating new credentials Also I have used this before so I am sure it is not a problem with the credentials. I somehow managed to get Chrome to show the data. Firefox will not and the code i have written also return authentication failed. This is a bug on your end. Please fix it ASAP.
Moz Pro | | ColumK0 -
Getting rid of duplicate content
Hi everyone, I'm a newbie and at the moment don't know very much about SEO. I have a problem with some of my campaigns where i keep getting a report with either Duplicate Page and/or Duplicate Content errors. I have no idea how to rectify this error, remove it or fix it on the relevant websites. Can anyone please help explain how to do this, maybe step by step? I really appreciate your views and opinions! Regards, Hugh
Moz Pro | | DigitalAcademyZA0 -
Getting relevant keywords from URL with Google KW Tool.
Hi, When I first start researching a site, I like to see what Google "thinks" it is relevant to. I use the Google KW Tool and enter the website URL only. I sort the results by relevance. I can then show the prospective client what Google thinks his site is optimized for and use that info to show him what opportunities exist to rank for terms more relevant to his business. I show him keyword, volume and I also get current SERP rank for his site. For larger sites, I do this for the top pages based Domain Authority. I want to automate this process using excel and APIs but Google refused my API token request. I told them I wanted to use the "Google AdWords API Extension for Excel" from http://seogadget.co.uk/google-adwords-plugin-excel. The Google API token team replied: Please note, after reviewing your application in detail, we are sorry to let you know that we won't be able to approve your token. We understand that you are planning to use the AdWords API mainly for Targeting Idea Service (TIS) and Traffic Estimation Service (TES) such as 'keyword research'. Please note that as per the Required Minimum Functionality (RMF) outlined in the API Terms & Conditions, using the AdWords API exclusively for TIS and TES type of services is not allowed. Q1: What does the KW Tool relevancy data mean, anyway? Q2: is there another way to get it or is there another way to do this? Q3: Is there a better approach I should take with the Google API team? Q4: Are there other APIs and Excel plugins that can do this, including the SEOMoz APIs? Thanks,
Moz Pro | | phersh
Phil0 -
Is the Linkscape Index getting updated?
I know it only gets a big refresh once a month...The last being in the early part of January. But I released a map that went viral (front page of CNET, Wired, Scientific American) that has gotten me hundreds of links from different domains: https://www.google.com/search?q=perceptionbuilder.com+map I got these links in early to mid December and OSE is showing no data at all. Similarly, I have been doing some linkbuilding for clients and links that I got in October and November and those links aren't showing up either. Anyone else experiencing this? Many thanks, Matt
Moz Pro | | coppersix0 -
RogerBot does not respect some rules??
Hello; Every week when I see my stats I notice that RogerBot has crawled 10000 form my website, even pages with a no index or not allowed in the robots.txt. Is it possible to avoid him from crawling the these pages? They are form pages in my site, with are not indexed by google, they have a noindex and they are not allowed for crawling in the robots.txt. Thanks everyone for your help!!!
Moz Pro | | jgomes0 -
Getting SEOMoz reports to ignore certain parameters
I want the SEOMoz reports to ignore duplicate content caused by link-specific parameters being added to URLs (same page reachable from different pages, having marker parameters regarding source page added to the URLs). I can get Google and Bing webmaster tools to ignore parameters I specify. I need to get SEOMoz tools to do it also!
Moz Pro | | SEO-Enlighten0