Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
An informational product page AND a shop page (for same brand)
Hi all, This is my first foray into e-commerce SEO. I'm working with a new client who sells upscale eBikes online. Since his products are expensive, he wants to have informational pages about the brands he sells eg. www.example.com/brand. However these brands are also category pages for his online shop eg. www.example.com/shop/brand I'm worried about keyword cannibalization and adding an extra step/click to get to the shop (right now the navigational menu takes you to the information page and from there you have to click to get to the shop) I'm pretty sure it would make more sense to have ONE killer shopping page that includes all the brand information but I want to be 100% sure before I advise him to take this big step. Thoughts?
Technical SEO | | MouthyPR1 -
Purchased domain with links - redirect page by page or entire domain?
Hi, I purchased an old domain with a lot of links that I'm redirecting to my site. I want all of their links to redirect to the same page on my site so I can approach this two different ways: Entire site
Technical SEO | | ninel_P
1.) RedirectMatch 301 ^(.*)$ http://www.xyz.com or Page by page
2). Redirect 301 /retiredpage.html http://www.xyz.com/newpage.html Is there a better option I should go with in regards to SEO effectiveness? Thanks in advance!0 -
Is it easier to rank high with a front page than a landing page?
My product is laptop and of cause, I like to rank high for the keyword "laptop". Do any of you know if the search engines tends to rank a front page higher than a landing page? Eg. www.brand.com vs. www.brand.com/laptop
Technical SEO | | Debitoor0 -
Pages Indexed Not Changing
I have several sites that I do SEO for that are having a common problem. I have submitted xml sitemaps to Google for each site, and as new pages are added to the site, they are added to the xml sitemap. To make sure new pages are being indexed, I check the number of pages that have been indexed vs. the number of pages submitted by the xml sitemap every week. For weeks now, the number of pages submitted has increased, but the number of pages actually indexed has not changed. I have done searches on Google for the new pages and they are always added to the index, but the number of indexed pages is still not changing. My initial thought was as new pages are added to the index, old ones are being dropped. But I can't find evidence of that, or understand why that would be the case. Any ideas on why this is happening? Or am I worrying about something that I shouldn't even be concerned with since new pages are being indexed?
Technical SEO | | ang1 -
Joomla creating duplicate pages, then the duplicate page's canonical points to itself - help!
Using Joomla, every time I create an article a subsequent duplicate page is create, such as: /latest-news/218-image-stabilization-task-used-to-develop-robot-brain-interface and /component/content/article?id=218:image-stabilization-task-used-to-develop-robot-brain-interface The latter being the duplicate. This wouldn't be too much of a problem, but the canonical tag on the duplicate is pointing to itself.. creating mayhem in Moz and Webmaster tools. We have hundreds of duplicates across our website and I'm very concerned with the impact this is having on our SEO! I've tried plugins such as sh404SEF and Styleware extensions, however to no avail. Can anyone help or know of any plugins to fix the canonicals?
Technical SEO | | JamesPearce0 -
Post vs page in Wordpress?
Hello there, I have a Wordpress site and would like to know if it is better to have 600 posts or 600 pages in terms of efficiency in the site. I would like to publish the content as pages, as I can have subapges,etc... and keep the path: www.website.com/page/subpage1... in terms of good SEO. This structure of using pages rahter than posts allow me to keep the path as stated above (with a category/post path I could not manage in this sense as a pile of articles is displayed although the path category/post in terms of SEO I understand would be good too). Thank you very much for your thoughts here as I would go for a page structure. Antonio
Technical SEO | | aalcocer20030 -
Different links to to the same page
Hi, Based on the user's actions we post activity into users Facebook timeline. And each activity has link back to our particular page on our website. For example if original page was: www.Domain.com from Facebook timeline it would be like this: www.Domain.com?Ffb_action_ids=101508953168 Do you think this will have a negative effect on our page rankings as we will eded up having a lot of different URL's to the same page? www.Domain.com?Ffb_action_ids=101508953168 www.Domain.com?Ffb_action_ids=456788765609 etc.. Thank you, Karen Bdoyan
Technical SEO | | showme0 -
Duplicate Page Titles
I had an issue where I was getting duplicate page titles for my index file. The following URLs were being viewed as duplicates: www.calusacrossinganimalhospital.com www.calusacrossinganimalhospital.com/index.html www.calusacrossinganimalhospital.com/ I tried many solutions, and came across the rel="canonical". So i placed the the following in my index.html: I did a crawl, and it seemed to correct the duplicate content. Now I have a new message, and just want to verify if this is bad for search engines, or if it is normal. Please view the attached image. i9G89.png
Technical SEO | | pixel830