Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
An informational product page AND a shop page (for same brand)
Hi all, This is my first foray into e-commerce SEO. I'm working with a new client who sells upscale eBikes online. Since his products are expensive, he wants to have informational pages about the brands he sells eg. www.example.com/brand. However these brands are also category pages for his online shop eg. www.example.com/shop/brand I'm worried about keyword cannibalization and adding an extra step/click to get to the shop (right now the navigational menu takes you to the information page and from there you have to click to get to the shop) I'm pretty sure it would make more sense to have ONE killer shopping page that includes all the brand information but I want to be 100% sure before I advise him to take this big step. Thoughts?
Technical SEO | | MouthyPR1 -
Is there a way to index important pages manually or to make sure a certain page will get indexed in a short period of time??
Hi There! The problem I'm having is that certain pages are waiting already three months to be indexed. They even have several backlinks. Is it normal to have to wait more than three months before these pages get an indexation? Is there anything i can do to make sure these page will get an indexation soon? Greetings Bob
Technical SEO | | rijwielcashencarry0400 -
What should I do about not found pages?
I took over a site that had been hacked. A bunch of pages were created that said domain.com/cms/viagra. The pages are gone but they still show in webmaster tools as not being found, which is what I want. However, should I do anything besides leaving them as 404?
Technical SEO | | EcommerceSite0 -
GWT - International Targeting
By selecting a country in the Country Targeting section of GWT what effect does this have? For example if I select UK will this boost rankings on google.co.uk and decrease them on google.com etc? If we are based in the UK but our customer base is worldwide should we not select anything?
Technical SEO | | twitime0 -
Page titles in browser not matching WP page title
I have an issue with a few page titles not matching the title I have In WordPress. I have 2 pages, blog & creative gallery, that show the homepage title, which is causing duplicate title errors. This has been going on for 5 weeks, so its not an a crawl issue. Any ideas what could cause this? To clarify, I have the page title set in WP, and I checked "Disable PSP title format on this page/post:"...but this page is still showing the homepage title. Is there an additional title setting for a page in WP?
Technical SEO | | Branden_S0 -
My report only says it crawled 1 page of my site.
My report used to crawl my entire site which is around 90 pages. Any idea of why this would happen? www.treelifedesigns.com
Technical SEO | | nathan.marcarelli0 -
Renaming of pages
About 2 months ago one of our clients renamed a section of his website. The worst part is that the URLs of the page also changed. New page: http://www.meresverige.dk/rejser/malmo Old page: http://www.meresverige.dk/rejser/malmoe The problem now is that the new page get absolutely no page-rank transfered from the old page. It also get no mozrank at all. Also if I try to find it in the Open Site Explorer it can not be found.The old page can, but not the new one. We have updated the sitemap.xml and also done proper 301 redirect for the pages since about 2 months. Any ideas here? This page was a very important page in terms of traffic so very much thankful for any input. Have a great day Fredrik
Technical SEO | | Resultify0