Crawl Errors Confusing Me

mjtaylor

The SEOMoz crawl tool is telling me that I have a slew of crawl errors on the blog of one domain. All are related to the MSNbot. And related to trackbacks (which we do want to block, right?) and attachments (makes sense to block those, too) ... any idea why these are crawl issues with MSNbot and not Google? My robots.txt is here: http://www.wevegotthekeys.com/robots.txt.

Thanks, MJ

Cyrus-Shepard

I'm a little late to the party, but I want to summarize what I see as the answer.

1. The "Search Engine Blocked by Robots.txt" is only a warning, and not an error. If you intend for these pages not to get crawled (and it does seem like you have a good reason for this), then there is nothing to worry about.

2. The reason the warning appears for MSNbot and not Google is that currently, your robots.txt allows Google to crawl those files. As Daniel pointed out, you would need to add the identical directives to your robots.txt file to make this happen. Does that make sense? Or you could just add all of these files under the * directive to apply to all robots.

mjtaylor

Yes, I thought that's what you meant ... thanks!

DanDeceuster

I am saying this:

User-agent: Googlebot
Noindex: /key-west-blog/*?*
Noindex: /key-west-blog/*.rss
Noindex: /key-west-blog/*feed
Noindex: /key-west-blog/*trackback
Noindex: /key-west-blog/*wp-
Noindex: /key-west-blog/tag/
Noindex: /key-west-blog/search/
Noindex: /key-west-blog/archives/
Noindex: /key-west-blog/category/
Noindex: /key-west-blog/2009
Noindex: /key-west-blog/2010

and this:

User-agent: Googlebot-Mobile
Noindex: /key-west-blog/?
Noindex: /key-west-blog/*.rss
Noindex: /key-west-blog/*feed
Noindex: /key-west-blog/*trackback
Noindex: /key-west-blog/*wp-
Noindex: /key-west-blog/tag/
Noindex: /key-west-blog/search/
Noindex: /key-west-blog/archives/
Noindex: /key-west-blog/category/
Noindex: /key-west-blog/2009
Noindex: /key-west-blog/2010


They use Noindex which is a syntax I am unfamiliar with in robots.txt. So you can check out http://www.robotstxt.org/robotstxt.html for more info on robots.txt and proper syntaxt. I would change Noindex: to Disallow: and that should fix the error in the robots.txt file.

mjtaylor

The robots.txt file DOES contain

User-agent: Msnbot
Crawl-delay: 120
Disallow: /key-west-blog/*?*
Disallow: /key-west-blog/*.rss
Disallow: /key-west-blog/*feed
Disallow: /key-west-blog/*trackback
Disallow: /key-west-blog/*wp-
Disallow: /key-west-blog/*login.php
Disallow: /key-west-blog/tag/
Disallow: /key-west-blog/search/
Disallow: /key-west-blog/archives/
Disallow: /key-west-blog/category/
Disallow: /key-west-blog/2009
Disallow: /key-west-blog/2010

But you are saying I should remove the lines with noindex?

DanDeceuster

In your robots.txt file, you have the Disallow: command under MSNbot and Noindex: under Googlebot. Noindex is not a robots.txt command. Change Noindex: to Disallow: and those pages will be blocked for all bots. Not sure if that is what is causing the issue, but that would explain the discrepancy. If you want to noindex a page, you do it with a meta tag like this:

You can change follow to nofollow if you want, really doesn't matter much.

ENSO

I have the same problem looks like MSN bot is disallowed from accessing wordpress content. So pages show up as ?page=111 so from what I understand so far anything that shows as below is blocked from MSNbot. I don't have a definite answer for you as to what to do, but I can tell you will need to "allow" msn bot the googlebot is.

Disallow: /key-west-blog/*?*

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Crawl Errors Confusing Me

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Can someone kindly explain what 'Crawl Issue Found: No rel="canonical" Tags' means? Is this a critical error and how can it be rectified?

Error in Moz duplicate content reports

5xx Server Errors

Help with URL parameters in the SEOmoz crawl diagnostics Error report

How does SEOMoz crawl sites? Does it follow the sitemap?

Why am I getting 400 client errors on pages that work?

Broken Links and Duplicate Content Errors?

How come when I export a error list I can only export the first page?