How can I best find out which URLs from large sitemaps aren't indexed?
-
I have about a dozen sitemaps with a total of just over 300,000 urls in them. These have been carefully created to only select the content that I feel is above a certain threshold.
However, Google says they have only indexed 230,000 of these urls. Now I'm wondering, how can I best go about working out which URLs they haven't indexed? No errors are showing in WMT related to these pages.
I can obviously manually start hitting it, but surely there's a better way?
-
There's no obvious function in WM tools, but having a look round there's this option:
http://www.aspfree.com/c/a/BrainDump/Extracting-Google-Indexed-Web-Site-Pages-Using-MS-Excel/
But Google will only display the first 1000 URLs on a site query so you would need to adapt it lots of times. From the looks of it there's not an easy way.
There's maybe a tool out there that is similar to Xenu, but checks the index status in Google also. I haven't ever had the need for this so I'm not aware of one, but the chances are there is something out there.
Good luck!
-
Any ideas on how to go about exporting indexed urls?
-
Hi Peter,
I'd attempt some sort of export of both indexed URLs and actual URLs into an Excel file and try and remove duplicates.
You would need to look into it but I'm sure there's a way of matching and removing duplicates.
Other than that I wouldn't know.
Ben
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
404's being re-indexed
Hi All, We are experiencing issues with pages that have been 404'd being indexed. Originally, these were /wp-content/ index pages, that were included in Google's index. Once I realized this, I added in a directive into our htaccess to 404 all of these pages - as there were hundreds. I tried to let Google crawl and remove these pages naturally but after a few months I used the URL removal tool to remove them manually. However, Google seems to be continually re/indexing these pages, even after they have been manually requested for removal in search console. Do you have suggestions? They all respond to 404's. Thanks
Technical SEO | | Tom3_151 -
How do I ensure that colour variant products aren't flagged for being duplicate?
I have a site with 12 colour variants of 1 style. How do I ensure that these are not flagged as duplicate content as they currently have been?
Technical SEO | | Ashcastle0 -
How do you 'close down' a website?
Hello all, If a company acquires a smaller company and 'absorbs' its products and services into its own website, what is the protocol with closing down the smaller company's site? So far we added our branding to the site alerting their visitors to the imminent takeover, and 301 redirected certain pages - soon we'll be redirecting all the pages to their counterparts on the main website. Once that's done, should we noindex the old site? Anything else? Thanks, Caro
Technical SEO | | Caro-O0 -
How to find temporary redirects of existing site you don't control?
I am getting ready to move a clients site from another company. They have like 35 tempory redirects according to MOZ. Question is, how can I find out then current redirects so I can update everything for the new site? Do I need access to the current htaccess file to do this?
Technical SEO | | scott3150 -
IP address URLs being indexed, 301 to domain?
I apoligize if this question as been asked before, I couldnt' find in the Q&A though. I noticed Google has been indexing our IP address for some pages (ie: 123.123.123.123/page.html instead of domain.com/page.html). I suspect this is possibly due to a few straggler relative links instead of absolute, or possibly something else I'm not thinking of. My less-evasive solution a few months back was to ensure canonical tags were on all pages, and then replaced any relative links w/ absolutes. This does not seem to be fixing the problem though, as recently as today new pages were scooped up with the IP address. My next thought is to 301 redirect any IP address URL to the domain, but I was afraid that may be too drastic and that the canonical should be sufficient (which it doesn't seem to be). Has anyone dealt with this issue? Do you think the 301 would be a safe move, any other suggestions? thanks.
Technical SEO | | KT6840 -
Removing a site from Google's index
We have a site we'd like to have pulled from Google's index. Back in late June, we disallowed robot access to the site through the robots.txt file and added a robots meta tag with "no index,no follow" commands. The expectation was that Google would eventually crawl the site and remove it from the index in response to those tags. The problem is that Google hasn't come back to crawl the site since late May. Is there a way to speed up this process and communicate to Google that we want the entire site out of the index, or do we just have to wait until it's eventually crawled again?
Technical SEO | | issuebasedmedia0 -
Rel canonical with index follow on query string URLs
Hi guys, Quick question regarding the rel canonical tag. I have lots of links pointing at me with query strings and previously used some code to determine if query strings were in the URL and if they were then not to index that page. If there weren't query strings then the page would be indexed and followed. I assume I can now use the rel canonical tag on each of these pages so the value goes to the proper URL minus any query string. However do I need to have the rel canonical tag above the index, follow tag on the page? So URL is site.com/page.html?ref=ABC meta robots is "index, follow" Rel canonical is "site.com/page.html" Does the order of the meta robots and canonical tag matter? Thanks in advance!
Technical SEO | | panini0 -
Which is the best wordpress sitemap plugin
Does anyone have a recommendation for the best xml sitemap plugin for wordpress sites or do you steer clear of plugins and use a sitemap generator then load it up to the root manually?
Technical SEO | | simoncmason0