Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Indexed pages
Just started a site audit and trying to determine the number of pages on a client site and whether there are more pages being indexed than actually exist. I've used four tools and got four very different answers... Google Search Console: 237 indexed pages Google search using site command: 468 results MOZ site crawl: 1013 unique URLs Screaming Frog: 183 page titles, 187 URIs (note this is a free licence, but should cut off at 500) Can anyone shed any light on why they differ so much? And where lies the truth?
Technical SEO | | muzzmoz1 -
How come only 2 pages of my 16 page infographic are being crawled by Moz?
Our Infographic titled "What Is Coaching" was officially launched 5 weeks ago. http://whatiscoaching.erickson.edu/ We set up campaigns in Moz & Google Analytics to track its performance. Moz is reporting No organic traffic and is only crawling 2 of the 16 pages we created. (see first and third attachments) Google Analytics is seeing hundreds of some very strange random pages (see second attachment) Both campaigns are tracking the url above. We have no idea where we've gone wrong. Please help!! 16_pages_seen_in_wordpress.png how_google_analytics_sees_pages.png what_moz_sees.png
Technical SEO | | EricksonCoaching0 -
Switchboard Tags - Multiple desktop pages pointing to one mobile page
I have recently started to implement switchboard tags to connect our mobile and desktop pages, and to ensure that our mobile pages show up in rankings for mobile users. Because our desktop site is much deeper in content than our mobile site, there are a number of desktop pages we would like to have point to one mobile page. However, with the switchboard tags, this poses a problem because it requires multiple rel=canonical tags to be placed on the one mobile page. I'm assuming this will either confuse the search engines, or they will choose to ignore the rel=canonical tag altogether. Any ideas on how to approach this situation other than creating an equivalent mobile version of every desktop page or implementing a user agent detection redirect?
Technical SEO | | JBlank0 -
How do I fix issue regarding near duplicate pages on website associated to city OR local pages?
I am working on one e-commerce website where we have added 300+ pages to target different local cities in USA. We have added quite different paragraphs on 100+ pages to remove internal duplicate issue and save our website from Panda penalty. You can visit following page to know more about it. And, We have added unique paragraphs on few pages. But, I have big concerns with other elements which are available on page like Banner Gallery, Front Banner, Tool and few other attributes which are commonly available on each pages exclude 4 to 5 sentence paragraph. I have compiled one XML sitemap with all local pages and submitted to Google webmaster tools since 1st June 2013. But, I can see only 1 indexed page by Google on Google webmaster tools. http://www.bannerbuzz.com/local http://www.bannerbuzz.com/local/US/Alabama/Vinyl-Banners http://www.bannerbuzz.com/local/MO/Kansas-City/Vinyl-Banners and so on... Can anyone suggest me best solution for it?
Technical SEO | | CommercePundit0 -
Page Content
Our site is a home to home moving listing portal. Consumers who wants to move his home fills a form so that moving companies can cote prices. We were generating listing page URL’s by using the title submitted by customer. Unfortunately we have understood by now that many customers have entered exactly same title for their listings which has caused us having hundreds of similar page title. We have corrected all the pages which had similar meta tag and duplicate page title tags. We have also inserted controls to our software to prevent generating duplicate page title tags or meta tags. But also the page content quality not very good because page content added by customer.(example: http://www.enakliyat.com.tr/detaylar/evden-eve--6001) What should I do. Please help me.
Technical SEO | | iskq0 -
Duplicate page content
Hello, The pro dashboard crawler bot thing that you get here reports the mydomain.com and mydomain.com/index.htm as duplicate pages. Is this a problem? If so how do I fix it? Thanks Ian
Technical SEO | | jwdl0 -
Home Page .index.htm and .com Duplicate Page Content/Title
I have been whittling away at the duplicate content on my clients' sites, thanks to SEOmoz's pro report, and have been getting push back from the account manager at register.com (the site was built here and the owner doesn't want to move it). He says these are the exact same page and he can't access one to redirect to the other. Any suggestions? The SEOmoz report says there is duplicate content on both these urls: Durango Mountain Biking | Durango Mountain Resort - Cascade Village http://www.cascadevillagehotel.com/index.htm Durango Mountain Biking | Durango Mountain Resort - Cascade Village http://www.cascadevillagehotel.com/ Your help is greatly appreciated! Sheryl
Technical SEO | | TOMMarketingLtd.0 -
Can I use canonical tags to merge property map pages and availability pages to their counterpart overview pages?
I have a property website, for each property are 4-5 tabs each with their own URL, these pages include the overview page which is content rich, and auxilliary pages such as maps, availability, can I use a canonical tag to merge the tabs with very little content to their corresponding overview page which is content rich? I.e. www.mywebsite.co.uk/property-1/overview This page has tabs for map, town info, availability which all have their own url i.e. www.mywebsite.co.uk/property-1/map
Technical SEO | | assertive-media
www.mywebsite.co.uk/property-1/availability
www.mywebsite.co.uk/property-1/towninfo Because these auxilary pages do not contain much content can I place a canonical tag in them pointing back to the content rich overview page at www.mywebsite.co.uk/property-1/overview?0