Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Numbers in URL
Hey guys! Need your many awesome brains. 🙂 This may be a very basic question but am hoping you can help me out with some insights beyond "because Google says it's better". 🙂 I only recently started working with SEO, and I work for a SaaS website builder company that has millions of open/active user sites, and all our user sites URLs, instead of www.mydomainname.com/gallery or myusername.simplesite.com/about, we use numbers, so www.mysite.com/453112 or myusername.simplesite.com/426521 The Sales manager has asked me to figure out if it will pay off for us in terms of traffic (other benefits?) to change it from the number system to the "proper" and right way of setting up these URLs. He's looking for rather concrete answers, as he usually sits with paid search and is therefore used to the mindset of "if we do x it will yield us y in z months". I'm finding it quite difficult to find case studies/other concrete examples beyond the generic, vague implication that it will simply be "better" (when for example looking at SEO checklists and search engine guidelines). Will it make a difference? How so? I have to convince our developers of the importance and priority of this adjustment, or it will just drown in the many projects they already have. So truly, any insights would be so very welcome. Thank you!
Technical SEO | | michelledemaree2 -
Can Google Crawl This Page?
I'm going to have to post the page in question which i'd rather not do but I have permission from the client to do so. Question: A recruitment client of mine had their website build on a proprietary platform by a so-called recruitment specialist agency. Unfortunately the site is not performing well in the organic listings. I believe the culprit is this page and others like it: http://www.prospect-health.com/Jobs/?st=0&o3=973&s=1&o4=1215&sortdir=desc&displayinstance=Advanced Search_Site1&pagesize=50000&page=1&o1=255&sortby=CreationDate&o2=260&ij=0 Basically as soon as you deviate from the top level pages you land on pages that have database-query URLs like this one. My take on it is that Google cannot crawl these pages and is therefore having trouble picking up all of the job listings. I have taken some measures to combat this and obviously we have an xml sitemap in place but it seems the pages that Google finds via the XML feed are not performing because there is no obvious flow of 'link juice' to them. There are a number of latest jobs listed on top level pages like this one: http://www.prospect-health.com/optometry-jobs and when they are picked up they perform Ok in the SERPs, which is the biggest clue to the problem outlined above. The agency in question have an SEO department who dispute the problem and their proposed solution is to create more content and build more links (genius!). Just looking for some clarification from you guys if you don't mind?
Technical SEO | | shr1090 -
Page for page 301 redirects from old server to new server
Hi guys:
Technical SEO | | cindyt-17038
I have a client who is moving their entire ecommerce site from one hosting platform (Yahoo Store) to another (BigCommerce) and from one domain to another. The old domain is registered with the Yahoo as of yesterday and we have redirected the old domain (at the domain level) to the new domain. However, we are having trouble getting the pages to redirect page for page. Currently they are all redirecting to the new domain home page. We did just move the old domain from GoDaddy to Yahoo yesterday thinking this would solve it however as of this morning the old pages are still redirecting to the home page of the new domain. To complete the 301 redirect picture, we uploaded the redirects (all relative links for both from and to) to BigCommerce. And while the domain was hosted at GoDaddy with a redirect to the new domain, they were working. We moved the domain to Yahoo because of email issues thinking it should still work. Is it possibly just a waiting game now as the change populates across the DNS? old url to test:
rock-n-roll-action-figures.com/fender-jazz-bass-miniature-guitar-replica-classic-red-finish.html0 -
Rel canonical for partner sites - product pages only or also homepage and other key pages?
Hello there Our main site is www.arenaflowers.com. We also run a number of partner sites (eg: http://flowershop.cancerresearchuk.org/). We've relcanonical'd the products on the partner site back to the main (arenaflowers.com) site. eg: http://flowershop.cancerresearchuk.org/flowers/tutti_frutti_es_2013 rel canonicals back to: http://www.arenaflowers.com/flowers/tutti_frutti_es_2013). My question: Should we also relcanonical the homepage and other key pages on partner sites back to the main arenaflowers website too? The content is similar but not identical. We don't want our partner sites to be outranking the original (as is the case on kw flower delivery for example). (NB this situation may be complicated by the fact we appear to have an unnatural link penalty on af.com (and when we did an upgrade a while back, the af.com site fell out of the index altogether due to some issues with our move to AWS.) We're getting professional SEO advice on this but wondered what the Moz community's thoughts were.. Cheers, Will
Technical SEO | | ArenaFlowers.com0 -
Product Pages Outranking Category Pages
Hi, We are noticing an issue where some product pages are outranking our relevant category pages for certain keywords. For a made up example, a "heavy duty widgets" product page might rank for the keyword phrase Heavy Duty Widgets, instead of our Heavy Duty Widgets category page appearing in the SERPs. We've noticed this happening primarily in cases where the name of the product page contains an at least partial match for the desired keyword phrase we want the category page to rank for. However, we've also found isolated cases where the specified keyword points to a completely irrelevent pages instead of the relevant category page. Has anyone encountered a similar issue before, or have any ideas as to what may cause this to happen? Let me know if more clarification of the question is needed. Thanks!
Technical SEO | | ShawnHerrick0 -
Link rel next previous VS duplicate page title
Hello, I am running into a little problem that i would like to have a feedback on. I am running multiple wordpress blogs and Seo Moz pro is telling me that i have duplicate title tags on canadiansavers.ca vs http://www.canadiansavers.ca/page/2 I can dynamically add a page 2 to the title but I am correctly using the link rel: next and rel:previous Why is it seen as a duplicate title tag and should i add the page 2, page 3... in the meta title thanks
Technical SEO | | pixweb0 -
Why is my office page not being indexed?
Good Morning from 24 degrees C partly cloudy wetherby UK 🙂 This page is not being indexed by Google:
Technical SEO | | Nightwing
http://www.sandersonweatherall.co.uk/office-to-let-leeds/ 1st Question Ive checked robots txt file no problems, i'm in the midst of updating the xml sitemap (it had the old one in place). It only has one link from this page http://www.sandersonweatherall.co.uk/Site-Map/ So is the reason oits not being indexed just a simple case of lack if SEO juice from inbound links so the remedy lies in routing more inbound links to the offending page? 2nd question Is the quickest way to diagnose if a web address is not being indexed to cut and paste the url in the Google search box and if it doesnt return the page theres a problem? Thanks in advance, David0 -
SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???
When I look at the Errors and Warnings on my Campaign Overview, I have a lot of "duplicate content" errors. When I view the errors/warnings SEOMoz indicates the number of pages with duplicate content, yet when I go to view them the subsequent page says no pages were found... Any ideas are greatly welcomed! Thanks Marty K.
Technical SEO | | MartinKlausmeier0