Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files."
So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Page ranked then disappeared
Recently there have been a a couple of pages form my website that ranked well, in top 5 for a couple of days then they disappear suddenly, they are not at all seen in google search results no matter how narrow I search for them. I checked my search console, there seems to be no issues with the page, but when I check google analytics, I do not get any data from that page since the day it disappeared, and it does not even show up on the 'active pages' section no matter I keep the url open in multiple computers.
Technical SEO | | JoelssonMedia
Has anyone else faced this issue? is there a solution to it?0 -
WEBMASTER console: increase in the number of URLs we were blocked from crawling due to authorization permission errors.
Hi guys,I received this warning in my webmaster console: "Google detected a significant increase in the number of URLs we were blocked from crawling due to authorization permission errors." So i went to "Crawl Errors" section and i found such errors under "Access denied" status: ?page_name=Cheap+Viagra+Gold+Online&id=471 ?page_name=Cheapest+Viagra+Us+Licensed+Pharmacies&id=1603 and many happy URLs like these. Does anybody know what this is and where it comes from? Thanks in advance!
Technical SEO | | odmsoft0 -
Google Webmaster tools Sitemap submitted vs indexed vs Index Status
I'm having an odd error I'm trying to diagnose. Our Index Status is growing and is now up to 1,115. However when I look at Sitemaps we have 763 submitted but only 134 indexed. The submitted and indexed were virtually the same around 750 until 15 days ago when the indexed dipped dramatically. Additionally when I look under HTML improvements I only find 3 duplicate pages, and I ran screaming frog on the site and got similar results, low duplicates. Our actual content should be around 950 pages counting all the category pages. What's going on here?
Technical SEO | | K-WINTER0 -
Page for Link Building
Hello guys, My question is about link building and reciprocal links. Since many directories request a reciprocal link, makes me wonder if is not better to create a unique page in the website only for this kind of links. What do you guys recommend? Thanks in advance, PP
Technical SEO | | PedroM0 -
Why is there such a big discrepancy between OSE and GWT regarding # backlinks?
Hello, We have been doing some analysis around our backlink profiles for our sites and have been experiencing a massive discrepancy between what is reported as number of C class linking domains in OSE and the information returned in Google Webmaster tools. For a variety of sites OSE is reporting numbers < 10 for C class linking doamins while GWT shows >100 unique domains linking (we confirmed that the majority of these links are in different C classes) Is this simply a matter of the limited index size of OSE or could there be another explanation? It is interesting that the links that do show up in OSE a nearly exclusively sites that we own. /T
Technical SEO | | tomypro0 -
Site maintenance and crawling
Hey all, Rarely, but sometimes we require to take down our site for server maintenance, upgrades or various other system/network reasons. More often than not these downtimes are avoidable and we can redirect or eliminate the client side downtime. We have a 'down for maintenance - be back soon' page that is client facing. ANd outages are often no more than an hour tops. My question is, if the site is crawled by Bing/Google at the time of site being down, what is the best way of ensuring the indexed links are not refreshed with this maintenance content? (ie: this is what the pages look like now, so this is what the SE will index). I was thinking that add a no crawl to the robots.txt for the period of downtime and remove it once back up, but will this potentially affect results as well?
Technical SEO | | Daylan1 -
Mass 404 pages
Hi Guys, If I were to have to take down the majority of my site, taking all content and links pointing to that content down. How would the search engines react? Would I get a penalty for the majority of the site all of the sudden missing? My only concern is the loss of traffic on the remanding pages. Thanks!
Technical SEO | | DPASeo0 -
Ranked on Page 1, now between page 40-50... Please help!
My site, http://goo.gl/h0igI was ranking on page one for many of our biggest keywords. All of a sudden, we completely fell off. I believe I'm down somewhere between page 40-50. I have no warning or error messages in webmaster tools. Can anyone please help me identify what the problem is? This is completely unexpected and I don't know how to fix it... Thanks in advance
Technical SEO | | Prime850