How can I best find out which URLs from large sitemaps aren't indexed?

rango

I have about a dozen sitemaps with a total of just over 300,000 urls in them. These have been carefully created to only select the content that I feel is above a certain threshold.

However, Google says they have only indexed 230,000 of these urls. Now I'm wondering, how can I best go about working out which URLs they haven't indexed? No errors are showing in WMT related to these pages.

I can obviously manually start hitting it, but surely there's a better way?

Audiohype

There's no obvious function in WM tools, but having a look round there's this option:

http://www.aspfree.com/c/a/BrainDump/Extracting-Google-Indexed-Web-Site-Pages-Using-MS-Excel/

But Google will only display the first 1000 URLs on a site query so you would need to adapt it lots of times. From the looks of it there's not an easy way.

There's maybe a tool out there that is similar to Xenu, but checks the index status in Google also. I haven't ever had the need for this so I'm not aware of one, but the chances are there is something out there.

Good luck!

rango

Any ideas on how to go about exporting indexed urls?

Audiohype

Hi Peter,

I'd attempt some sort of export of both indexed URLs and actual URLs into an Excel file and try and remove duplicates.

You would need to look into it but I'm sure there's a way of matching and removing duplicates.

Other than that I wouldn't know.

Ben

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

How can I best find out which URLs from large sitemaps aren't indexed?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Wrong canonical URL was specified. How to refresh the index now?

Getting a Vanity (Clean) URL indexed

Can you noindex a page, but still index an image on that page?

Duplicate Content - What's the best bad idea?

Duplicate Page Content error but I can't see it

SEO url best practices

Why won't google rank my homepage

How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?