Log files vs. GWT: major discrepancy in number of pages crawled

ufmedia

Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:

Number of pages crawled per log files: 2993
Crawl frequency (i.e. number of times those pages were crawled): 61438
Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)

We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?

danatanseo

I'll reserve my answer until you hear from your dev team. A massive site for sure.

One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.

I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).

So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.

So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.

This is a very interesting thread so I want to know more. Cheers!

ufmedia

Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:

Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
Number indexed: 17,182,818
Difference: 3,300,778
Number of URLs throwing 404 errors: 2,810,650
2,810,650 / 3,300,778 = 85%

I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.

(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)

danatanseo

Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Log files vs. GWT: major discrepancy in number of pages crawled

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Numbers in URL

Can Google Crawl This Page?

Page for page 301 redirects from old server to new server

Rel canonical for partner sites - product pages only or also homepage and other key pages?

Product Pages Outranking Category Pages

Link rel next previous VS duplicate page title

Why is my office page not being indexed?

SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???