How can I best find out which URLs from large sitemaps aren't indexed?
-
I have about a dozen sitemaps with a total of just over 300,000 urls in them. These have been carefully created to only select the content that I feel is above a certain threshold.
However, Google says they have only indexed 230,000 of these urls. Now I'm wondering, how can I best go about working out which URLs they haven't indexed? No errors are showing in WMT related to these pages.
I can obviously manually start hitting it, but surely there's a better way?
-
There's no obvious function in WM tools, but having a look round there's this option:
http://www.aspfree.com/c/a/BrainDump/Extracting-Google-Indexed-Web-Site-Pages-Using-MS-Excel/
But Google will only display the first 1000 URLs on a site query so you would need to adapt it lots of times. From the looks of it there's not an easy way.
There's maybe a tool out there that is similar to Xenu, but checks the index status in Google also. I haven't ever had the need for this so I'm not aware of one, but the chances are there is something out there.
Good luck!
-
Any ideas on how to go about exporting indexed urls?
-
Hi Peter,
I'd attempt some sort of export of both indexed URLs and actual URLs into an Excel file and try and remove duplicates.
You would need to look into it but I'm sure there's a way of matching and removing duplicates.
Other than that I wouldn't know.
Ben
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Japanese URL-structured sitemap (pages) not being indexed by Bing Webmaster Tools
Hello everyone, I am facing an issue with the sitemap submission feature in Bing Webmaster Tools for a Japanese language subdirectory domain project. Just to outline the key points: The website is based on a subdirectory URL ( example.com/ja/ ) The Japanese URLs (when pages are published in WordPress) are not being encoded. They are entered in pure Kanji. Google Webmaster Tools, for instance, has no issues reading and indexing the page's URLs in its sitemap submission area (all pages are being indexed). When it comes to Bing Webmaster Tools it's a different story, though. Basically, after the sitemap has been submitted ( example.com/ja/sitemap.xml ), it does report an error that it failed to download this part of the sitemap: "page-sitemap.xml" (basically the sitemap featuring all the sites pages). That means that no URLs have been submitted to Bing either. My apprehension is that Bing Webmaster Tools does not understand the Japanese URLs (or the Kanji for that matter). Therefore, I generally wonder what the correct way is to go on about this. When viewing the sitemap ( example.com/ja/page-sitemap.xml ) in a web browser, though, the Japanese URL's characters are already displayed as encoded. I am not sure if submitting the Kanji style URLs separately is a solution. In Bing Webmaster Tools this can only be done on the root domain level ( example.com ). However, surely there must be a way to make Bing's sitemap submission understand Japanese style sitemaps? Many thanks everyone for any advice!
Technical SEO | | Hermski0 -
Can Google index the text content in a PDF?
I really really thought the answer was always no. There's plenty of other things you can do to improve search visibility for a PDF, but I thought the nature of the file type made the content itself not-parsable by search engine crawlers... But now, my client's competitor is ranking for my client's brand name with a PDF that contains comparison content. Thing is, my client's brand isn't in the title, the alt-text, the url... it's only in the actual text of the PDF. Did I miss a major update? Did I always have this wrong?
Technical SEO | | LindsayDayton0 -
Question on URL wording and structure best practices
We're mapping out some URL structures and trying to figure out what would be best for separating folders for articles and videos regarding wording in the folder say: www.site.com/category/article/name-of-article/id#/ ---- www.site.com/category/video/name-of-video/id#/ vs. www.site.com/category/a/name-of-article/id#/ ---- www.site.com/category/v/name-of-video/id#/ Second option came about the ''shorter is better' way of thinking. Downside I see to it is if the link would be copied and pasted somewhere probably would be best for a user to make it clear they are clicking into an article or a video, don't think just an 'a' or a 'v' would be very telling in that scenario. Would it be better for search engines to make it clearer with the whole word in there? Any other pros and cons to each? Not sure what's the best route here.
Technical SEO | | SBRMarketing0 -
What is the best program to create an html sitemap?
I already have an xml sitemap, so I've been researching how to create an html sitemap with over 10,000 urls for an ecommerce website. Any program, paid or unpaid, just needs to be created so it looks good to put in the footer of our website.
Technical SEO | | ntsupply0 -
WebMaster Tools keeps showing old 404 error but doesn't show a "Linked From" url. Why is that?
Hello Moz Community. I have a question about 404 crawl errors in WebmasterTools, a while ago we had an internal linking problem regarding some links formed in a wrong way (a loop was making links on the fly), this error was identified and fixed back then but before it was fixed google got to index lots of those malformed pages. Recently we see in our WebMaster account that some of this links still appearing as 404 but we currently don't have that issue or any internal link pointing to any of those URLs and what confuses us even more is that WebMaster doesn't show anything in the "Linked From" tab where it usually does for this type of errors, so we are wondering what this means, could be that they still in google's cache or memory? we are not really sure. If anyone has an idea of what this errors showing up now means we would really appreciate the help. Thanks. jZVh7zt.png
Technical SEO | | revimedia1 -
Correct linking to the /index of a site and subfolders: what's the best practice? link to: domain.com/ or domain.com/index.html ?
Dear all, starting with my .htaccess file: RewriteEngine On
Technical SEO | | inlinear
RewriteCond %{HTTP_HOST} ^www.inlinear.com$ [NC]
RewriteRule ^(.*)$ http://inlinear.com/$1 [R=301,L] RewriteCond %{THE_REQUEST} ^./index.html
RewriteRule ^(.)index.html$ http://inlinear.com/ [R=301,L] 1. I redirect all URL-requests with www. to the non www-version...
2. all requests with "index.html" will be redirected to "domain.com/" My questions are: A) When linking from a page to my frontpage (home) the best practice is?: "http://domain.com/" the best and NOT: "http://domain.com/index.php" B) When linking to the index of a subfolder "http://domain.com/products/index.php" I should link also to: "http://domain.com/products/" and not put also the index.php..., right? C) When I define the canonical ULR, should I also define it just: "http://domain.com/products/" or in this case I should link to the definite file: "http://domain.com/products**/index.php**" Is A) B) the best practice? and C) ? Thanks for all replies! 🙂
Holger0 -
What is URL Enforce Writer & How it can be write.
Hi, What is URL enforce writer to write existing web page URL's. Currently a website pages having underscore in it, I would like to use hyphen (-) in between the words. Here is URL: http://www.cleanitsupply.com/t-Janitorial_Supplies_New_York_City.aspx Please suggest me how I can use URL enforce write to re-write URL's without 301. Your quick answers will be appreciated. Note: This page having back external backlinks. Thanks
Technical SEO | | younus0 -
Grr . . . Just can't seem to get there
mrswitch.com.au is one site that we are consistantly struggling with . . . It has a page rank of 3 which beats most of the competitors, but when it comes to Google AU searches such as Sydney Electrician and Electrician Sydney etc, we just can't seem to get there and the rankings keep dropping. We backlink and update the pages on a regular basis Any ideas? - Could it be the custom CMS system?
Technical SEO | | kayweb0