Rogerbot Ignoring Robots.txt?
-
Hi guys,
We're trying to block Rogerbot from spending 8000-9000 of our 10000 pages per week for our site crawl on our zillions of PhotoGallery.asp pages. Unfortunately our e-commerce CMS isn't tremendously flexible so the only way we believe we can block rogerbot is in our robots.txt file.
Rogerbot keeps crawling all these PhotoGallery.asp pages so it's making our crawl diagnostics really useless.
I've contacted the SEOMoz support staff and they claim the problem is on our side. This is the robots.txt we are using:
User-agent: rogerbot
Disallow:/PhotoGallery.asp
Disallow:/pindex.asp
Disallow:/help.asp
Disallow:/kb.asp
Disallow:/ReviewNew.asp
User-agent: *
Disallow:/cgi-bin/
Disallow:/myaccount.asp
Disallow:/WishList.asp
Disallow:/CFreeDiamondSearch.asp
Disallow:/DiamondDetails.asp
Disallow:/ShoppingCart.asp
Disallow:/one-page-checkout.asp
Sitemap: http://store.jrdunn.com/sitemap.xml
For some reason the Wysiwyg edit is entering extra spaces but those are all single spaced.
Any suggestions? The only other thing I thought of to try is to something like "Disallow:/PhotoGallery.asp*" with a wildcard.
-
I have just encountered an interesting thing about Moz Link Search and its bot: if you do a search for Domains linking to Google.com , you find a list of about 900 000 domains, among which I was surprised to find webcache.googleusercontent.com
See the proof below in attache screen shot.
At the same time, the webcache.googleusercontent.com policy for robots is as shown in the second attachment.
In my opinion, there is only one possible explanation: Moz Bot does ignore robots.txt files...
-
Thanks Cyrus,
No, for some reason the editor double-spaced the file when I pasted. Other than that, it's the same though.
Yes, I actually tried ordering the exclusions both ways. Neither works.
The robots.txt checkers report no errors. I had actually checked them before posting.
Before I posted this, I was pretty convinced the problem wasn't in our robots.txt but the Seomoz support staff says essentially, "We don't think the problem is with Rogerbot, so it must be in your robots.txt file, but we can't look at that, so if by some chance your robots.txt file is fine, then there's nothing we can do for you because we're just going to assume the problem is on your side."
I figured, with everything I've already tried, and if the fabulous SEOMoz community can't come up with a solution, that'll be the best I can do.
-
Hi Kelly,
Thanks for letting us know. Could be a couple of things right off the bat. Is this your exact robots.txt file? If so, it's missing some formatting like proper spacing to be perfectly compliant. You can run a check of your robots.txt file at serveral places.
http://tool.motoricerca.info/robots-checker.phtml
http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php
http://www.sxw.org.uk/computing/robots/check.html
Also, it's generally a good idea to put specific inclusions towards the bottom, so I might flip the order and put the rogerbot directives last and the User-agent: * first.
Hope this helps. Let us know if any of this points in the right direction.
-
Thanks so much for the tip. Unfortunately still unsuccessful. (shrug)
-
Try
Disallow: /PhotoGallery.asp
I put wild cards all over usually just to be sure and had no issues so far.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Block Moz (or any other robot) from crawling pages with specific URLs
Hello! Moz reports that my site has around 380 duplicate page content. Most of them come from dynamic generated URLs that have some specific parameters. I have sorted this out for Google in webmaster tools (the new Google Search Console) by blocking the pages with these parameters. However, Moz is still reporting the same amount of duplicate content pages and, to stop it, I know I must use robots.txt. The trick is that, I don't want to block every page, but just the pages with specific parameters. I want to do this because among these 380 pages there are some other pages with no parameters (or different parameters) that I need to take care of. Basically, I need to clean this list to be able to use the feature properly in the future. I have read through Moz forums and found a few topics related to this, but there is no clear answer on how to block only pages with specific URLs. Therefore, I have done my research and come up with these lines for robots.txt: User-agent: dotbot
Moz Pro | | Blacktie
Disallow: /*numberOfStars=0 User-agent: rogerbot
Disallow: /*numberOfStars=0 My questions: 1. Are the above lines correct and would block Moz (dotbot and rogerbot) from crawling only pages that have numberOfStars=0 parameter in their URLs, leaving other pages intact? 2. Do I need to have an empty line between the two groups? (I mean between "Disallow: /*numberOfStars=0" and "User-agent: rogerbot")? (or does it even matter?) I think this would help many people as there is no clear answer on how to block crawling only pages with specific URLs. Moreover, this should be valid for any robot out there. Thank you for your help!0 -
Huge spike in crawl errors today - mozbot ignoring noindex tag?
Hi Mozzers, Today I received a ton of errors and warnings in my weekly crawl due to the mozbot crawling my noindex'd search results pages, such as this - http://www.consumerbase.com/Mailing-Lists.html?q=Construction&type=bus&channel=all&page=7&order=title&orderBy=DESC See image: http://screencast.com/t/qaZzq78j2Udx Anyone else seen a similar error this week with their crawl? Thanks!
Moz Pro | | Travis-W0 -
The pages that add robots as noindex will Crawl and marked as duplicate page content on seo moz ?
When we marked a page as noindex with robots like {<meta name="<a class="attribute-value">robots</a>" content="<a class="attribute-value">noindex</a>" />} will crawl and marked as duplicate page content(Its already a duplicate page content within the site. ie, Two links pointing to the same page).So we are mentioning both the links no need to index on SE.But after we made this and crawl reports have no change like it tooks the duplicate with noindex marked pages too. Please help to solve this problem.
Moz Pro | | trixmediainc0 -
How do you tell SEOmoz to ignore a subdomain?
We have a subdomain that I don't want to show up in our root SEOmoz campaign. How do I tell SEOmoz to ignore it?
Moz Pro | | wcsjohn0 -
How to get rogerbot whitelisted for application firewalls.
we have recently installed an application firewall that is blocking rogerbot from crawling our site. Our IT department has asked for an IP address or range of IP addresses to add to the acceptable crawlers. If rogerbot has a dynamic IP address how to we get him added to our whitelist? The product IT is using is from F5 called Application Security Manager.
Moz Pro | | Shawn_Huber0 -
How to get rid of the message "Search Engine blocked by robots.txt"
During the Crawl Diagnostics of my website,I got a message Search Engine blocked by robots.txt under Most common errors & warnings.Please let me know the procedure by which the SEOmoz PRO Crawler can completely crawl my website?Awaiting your reply at the earliest. Regards, Prashakth Kamath
Moz Pro | | 1prashakth0 -
What is the full User Agent of Rogerbot?
What's the exact string that Rogerbot send out as his UserAgent within the HTTP Request? Does it ever differ?
Moz Pro | | rightmove0 -
Is there a whitelist of the RogerBot IP Addresses?
I'm all for letting Roger crawl my site, but it's not uncommon for malicious spiders to spoof the User-Agent string. Having a whitelist of Roger's IP addresses would be immensely useful!
Moz Pro | | EricCholis1