Google: How to See URLs Blocked by Robots?

lbohen

Google Webmaster Tools says we have 17K out of 34K URLs that are blocked by our Robots.txt file.

How can I see the URLs that are being blocked?

Here's our Robots.txt file.

User-agent: * Disallow: /swish.cgi Disallow: /demo Disallow: /reviews/review.php/new/ Disallow: /cgi-audiobooksonline/sb/order.cgi Disallow: /cgi-audiobooksonline/sb/productsearch.cgi Disallow: /cgi-audiobooksonline/sb/billing.cgi Disallow: /cgi-audiobooksonline/sb/inv.cgi Disallow: /cgi-audiobooksonline/sb/new_options.cgi Disallow: /cgi-audiobooksonline/sb/registration.cgi Disallow: /cgi-audiobooksonline/sb/tellfriend.cgi Disallow: /*?gdftrk

Sitemap: http://www.audiobooksonline.com/google-sitemap.xml

ThompsonPaul

It seems you might be asking two different questions here, Larry.

You ask which URLs are blocked by your robots file. You then answered your own question by listing the entries in your robots file which are the actual URLs that it is blocking.

If in fact what you want to know is which pages exist on your website but are not currently indexed, that's a much bigger question and requires a lot more work to answer.

There is no way Webmaster Tools can give you that answer, because if it was aware of the URL it would already be indexing it.

HOWEVER! It is possible to do it if you are willing to do some of the work on your own to collect and manipulate data using several tools. Essentially, you have to do it in three steps:

create a list of all the URLs that Google says are indexed. (This info comes from Google's SERPs.)

then create a separate list of all of the URLs that actually exist on your website. (This must come from a 3rd-party tool you run against your site yourself.)

From there, you will use Excel to subtract the indexed URLs from the known URLs, leaving a list of non-indexed URLS, which is what you asked for.

I actually laid out this process step-by-step in response to an earlier question, so you can read the process there http://www.seomoz.org/q/how-to-determine-which-pages-are-not-indexed

Is that what you were looking for?

Paul

McCannSEO

Okay, well the robots.txt will only be excluding robots from the folders and URLs specified and as I say, there's no way to download a list of all the URLs that Google is not indexing from webmaster tools.

If you have exact URLs in mind which you think might be getting excluded, you can test individual URLs in Google Webmaster Tools in:

Health > Blocked URLs > URLs Specify the URLs and user-agents to test against.

Beyond this, if you want to know if there are URLs that shouldn't be excluded in the folders you have specified, I would run a crawl of your website using SEOMoz' crawl test or Screaming Frog. Then sort the URLs alphabetically and make sure that all of the URLs in the folders you have excluded via robots.txt are ones that you want to exclude.

lbohen

I want to make sure that Google is indexing all of our pages we want them to. I.E. That all of the NOT indexed URLs are valid.

McCannSEO

Hi Larry

Why do you want to find those URLs out for my understanding? Are you concerned that the robots.txt is blocking URLs it shouldn't be?

As for downloading a list of URLs which aren't indexed from Google Webmaster Tools, which is what I think you would really like, this isn't possible at the moment.

lbohen

Liz; Perhaps my post was unclear or I am misunderstanding your answer.

I want to find out the specific URLs that Google says it isn't indexing because of our Robots.txt file.

McCannSEO

If you want to see if Google has indexed individual pages which are supposed to be excluded, you can check the URLs in your robots.txt using the site: command.

E.g. type the following into Google:

site:http://www.audiobooksonline.com/swish.cgi
site:http://www.audiobooksonline.com/reviews/review.php/new/
...continue for all the URLs in your robots.txt

Just from searching on the last example above (site:http://www.audiobooksonline.com/reviews/review.php/new/) I can see that you have results indexed. This is probably because you added the robots.txt after it was already indexed.

To get rid of these results you need to take the culprit line out of the robots.txt, add the robots meta tag set to noindex to all pages you want removed, submit a URL removal request via webmaster tools, check it has been nonidexed then you can add the line back into the robots.txt.

This is the tag:

I hope that makes sense and is useful!

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Google: How to See URLs Blocked by Robots?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Domain name in URL

How long will old pages stay in Google's cache index. We have a new site that is two months old but we are seeing old pages even though we used 301 redirects.

No images in Google index

Why is a poor optimized url ranked first on Google ?

Google+ Page Question

Rewriting URL

Sites banned from Google?

New URL : Which is best