How can I get a list of every url of a site in Google's index?
-
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is.
Is there a way to get an excel sheet with every url Google has indexed for a site?
Thanks... Mike
-
If this is still an issue you're facing, have you checked the sitemap settings to see which page types are getting included? For example, a site with a few thousand tags that are not entered in the sitemap but not yet set to noindex could easily produce extra pages like this.
The next step is parameterization. Anything going on there with search URLs or product URLs? eg ?refid=1235134&q=search+term or ?prod=152134&variant=blue
If you really want to scrape through Google, get a list of your sitemap and scrape queries like "inurl:domain.com/a", "inurl:domain.com/b", "inurl:domain.com/c". etc. This should allow you to dive deeper into the site map to see what Google really has indexed. For URL subfolders with tons of URLs like domain.com/product/a, you'll want to do the same thing at a subfolder level instead of root URLs.
-
You can do that with a tool like Scrapebox or Outwit. Go slow, or else you'll need to use proxies to get Google to respond fast enough. As another commenter mentioned, it's probably against TOS.
-
You could probably write a macro to do this, although just because you could doesn't mean you should. I don't think it is advisable because you do not want to violate any terms of use for anyone. That is never a good thing.
-
Yes, WMT API doesn't have it. The site site:xxxx.com search is where are got one of the two too high numbers. Thanks... Mike
-
Hi Marijn,
Thanks for the suggestions. 2.5 years of G/A organic landing pages is 10,000 urls.... 1/2 as many as the site map and 1/3rd as many as Google says indexed. On scraping google, do you know of a tool for that?
Thanks... Mike
-
Might be something you can get from the WMT API.
Also, to really see how many pages are indexed, do a site:xxxx.com search, go to the last page, include omitted results, go to the last page again, and add up how many you have. That's probably the most accurate number.
-
Hi Mike,
There a couple of solutions, neither of them provide you with 100% of data. The best would be to export a list of landing pages from Google Analytics or your favorite web analytics tool segmented by organic search/ Google. This would provide you with a list of pages that received traffic via search and so are indexed. If you cross reference them with your sitemaps that might already help you out a bit. Besides that you could crawl and scrape the URLS for a site:xxx.com search.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Magento: Should we disable old URL's or delete the page altogether
Our developer tells us that we have a lot of 404 pages that are being included in our sitemap and the reason for this is because we have put 301 redirects on the old pages to new pages. We're using Magento and our current process is to simply disable, which then makes it a a 404. We then redirect this page using a 301 redirect to a new relevant page. The reason for redirecting these pages is because the old pages are still being indexed in Google. I understand 404 pages will eventually drop out of Google's index, but was wondering if we were somehow preventing them dropping out of the index by redirecting the URL's, causing the 404 pages to be added to the sitemap. My questions are: 1. Could we simply delete the entire unwanted page, so that it returns a 404 and drops out of Google's index altogether? 2. Because the 404 pages are in the sitemap, does this mean they will continue to be indexed by Google?
Intermediate & Advanced SEO | | andyheath0 -
What was your experience with changing site url's?
I work with a company that is about to move to a new platform. Because the category and page structure is different every almost every url but the home page will need to be 301 redirected. I know how to do this and am pretty sure I will find and fix 99% ahead of time and not have too many 404's showing up in webmaster tools to clean up. My question is has anyone who is reading this post had to do this before and what was your experience with organic traffic after you made the switch. I am predicting that even if I successfully redirected 100% of the url's there would be some loss for a couple of months just due to the fact that we are making a major change. My bosses are asking if there will be any loss and I need to tell them what to expect.
Intermediate & Advanced SEO | | KentH0 -
Google displaying a content box above the listing link for top ranking listing in SERPs
Hi, In the attached Google SERP example the first listing below the paid search ads has a large box with a snippet of content from the relevant page then followed by the standard link. Does anyone know how you get Google to display a box like this in their SERPs? I checked the code on the page and there doesn't appear to be anything special about it such as any schema markup. It uses standard list code. Does this only appear for particular types of content or sites, such as medical content in this case? Is the content more likely to appear for lists? Does it only appear for high authority sites that Google has selected? We have a similar medical information based site and it would be great to try to get Google to display a similar box of content for some of our pages. Thanks. Damien ZmPJVSl.png
Intermediate & Advanced SEO | | james.harris0 -
Why is my client's site not ranking anymore? Like big time!
Ok, I'm reaching out to all of you Moz'rs for some help with this one. My client's site has dropped off the face of google in a real short period of time. It went from page 1 (avg rank 3 to page 6 (avg rank 50) and below in the matter of 2 weeks. Here's some facts: 1. DA is a 22 and homepage PA is a 31. It outranks all other sites in its competitive set. 2. The homepage used to be the page that displays for keyword searches, now its the FAQ page, which has a lower PA of 23. Why has the home page seemingly vaporized? And, why is the FAQ showing as the first result? What should I start checking. I feel paralyzed, not sure where to start. More info: a. There are no alerts present in Webmaster Tools. b. For some reason the homepage (domain.com) was 301'd to domain.com/home.html. Domain.com is indexed by Google, however, domain.com/home.html is not. If this is the issue, what is the best way to handle it? Thanks in advance for your help!
Intermediate & Advanced SEO | | rhoadesjohn1 -
Need help on SEO for my site. Can't figure out what is wrong.
My site, findyogi.com, isn't ranking well in google SERPs. For some good content and matching keyword, my pages are ranking 200+ whereas other sites that have similar or lower authority are ranking in top 10. I must be doing something fundamentally wrong but can't seem to figure out what. I am not looking at ranking 1 on google right now but my pages don't appear even on page 2-4. Sample Keyword- "Samsung galaxy s4 price in india" . Matching page - www.findyogi.com/mobiles/samsung/samsung-galaxy-s4-b94a37/price Please help.
Intermediate & Advanced SEO | | namansr0 -
Google suddenly indexing and displaying URLs that haven't existed for years?
We recently noticed google is showing approx 23,000 indexed .jsp urls for our site. These are ancient pages that haven't existed in years and have long been 301 redirected to valid urls. I'm talking 6 years. Checking the serps the other day (and our current SEOMoz pro campaign), I see that a few of these urls are now replacing our correct ones in the serps for important, competitive phrases. What the heck is going on here? Is Google suddenly ignoring rewrite rules and redirects? Here's an example of the rewrite rules that we've used for 6+ years: RewriteRule ^(.*)/xref_interlux_antifoulingoutboards&keels.jsp$ $1/userportal/search_subCategory.do?categoryName=Bottom%20Paint&categoryId=35&refine=1&page=GRID [R=301] Now, this 'bottom paint' url has been incredibly stable in the serps for over a half decade. All of a sudden, a google search for 'bottom paint' (no quotes) brings up the jsp page at position 2-3. This is just one example of something very bizarre happening. Has anyone else had something similar happen lately? Thank You <colgroup><col width="64"></colgroup>
Intermediate & Advanced SEO | | jamestown
| RewriteRule ^(.*)/xref_interlux_antifoulingoutboards&keels.jsp$ $1/userportal/search_subCategory.do?categoryName=Bottom%20Paint&categoryId=35&refine=1&page=GRID [R=301] |0 -
How can someone not call B.S. on this site ranking 4th.
We manage a lot of sites that are around pharmaceuticals and lawsuits. I was checking a couple of the sites around the keyword: Actos Lawsuit using the keyword difficulty with serp analysis. Our sites have done very little Adwords except for first month about a year ago and we have always ranked well and the client is very happy with the results. Tonight I notice a site that is http://wikilawsuit.org/drug-recalls/actos-side-effects-bladder-cancer-actos-lawsuit/ They are ranked fourth on Google. Our url which is http://actos-lawsuit.org/ is ranked 9th?? Frankly there are several sites ranked ahead and when you look at the parameters all the way across some we are killing. But Wiki, everyone is killing and it is still fourth. I ran it in OSE and the metrics came back better, but there is at best 3 to 4 real links out of 30 domains. This is a commercial site with a contact form in right sidebar and my guess is they are selling leads to lawyers. So they are about as Wiki as Hooters. That said, we see all the talk about quality links and I am seeing a lot of sites with few quality links and lots of junk links. Should we still believe it matters? Or, is it that it matters when the sites are huge (JC Penny), etc. but not if the site is under some critical number of poor links? Looking forward to a moz Fest on this.
Intermediate & Advanced SEO | | RobertFisher0