Working out exactly how Google is crawling my site if I have loooots of pages

soeren.hofmayer

I am trying to work out exactly how Google is crawling my site including entry points and its path from there. The site has millions of pages and hundreds of thousands indexed. I have simple log files with a time stamp and URL that google bot was on. Unfortunately there are hundreds of thousands of entries even for one day and as it is a massive site I am finding it hard to work out the spiders paths. Is there any way using the log files and excel or other tools to work this out simply? Also I was expecting the bot to almost instantaneously go through each level eg. main page--> category page ---> subcategory page (expecting same time stamp) but this does not appear to be the case. Does the bot follow a path right through to the deepest level it can/allowed to for that crawl and then returns to the higher level category pages at a later time? Any help would be appreciated

Cheers

wazza1985

Can you explain to me how you did your site map for this please?

eyepaq

I've run into the same issue for a site with 40 k + pages - far from your overall page # but still .. maybe it's the same flow overall.

The site I was working on had a structure of about 5 level deep. Some of the areas within the last level were out of reach and they didn't get indexed. More then that even a few areas on level 2 were not present in the google index and the google boot didn't visit those either.

I've created a large xml site map and a dynamic html sitemap with all the pages from the site and submit it via webmaster tool (the xml sitemap that is) but that didn't solve the issue and the same areas were out of the index and didn't got hit. Anyway the huge html sitemap was impossible to follow from a user point of view so I didn't keep that online for long but I am sure it can't work that way either.

What i did that finally solved the issue was to spot the exact areas that were left out, identify the "head" of those pages - that means several pages that acted as gateway for the entire module and I've build a few outside links that pointed to those pages directly and a few that were pointed to main internal pages of those modules that were left out.

Those pages gain authority fast and only in a few days we've spotted the google boot staying over night

All pages are now indexed and even ranking well.

If you can spot some entry pages that can conduct the spider to the rest you can try this approach - it should work for you too.

As far as links I've started with social network links, a few posts with links within the site blog (so that means internal links) and only a couple of outside links - articles with content links for those pages. Overall I think we are talking about 20-25 social network links (twitter, facebook, digg, stumble and delic), about 10 blog posts published in a 2-3 days span and about 10 articles in outside sources.

Since you have a much larger # as far as pages you probably will need more gateways and that means more links - but overall it's not a very time consuming session and it can solve your issue... hopefully

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Working out exactly how Google is crawling my site if I have loooots of pages

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Redirecting Pages During Site Migration

Would it work to place H1 (or important page keywords) at the top of your page in HTML and move lower on page with CSS?

Does Google still don't index Hashtag Links ? No chance to get a Search Result that leads directly to a section of a page? or to one of numeras Hashtag Pages in a single HTML page?

Multi-Location SEO: Sites vs Pages

Would spiders successfully crawl a page with two distinct sets of content?

Development site crawled

HTML5 one page website on-site SEO

Our site is recieving traffic for both .com/page and .com/page/ with the trailing slash.