Working out exactly how Google is crawling my site if I have loooots of pages
-
I am trying to work out exactly how Google is crawling my site including entry points and its path from there. The site has millions of pages and hundreds of thousands indexed. I have simple log files with a time stamp and URL that google bot was on. Unfortunately there are hundreds of thousands of entries even for one day and as it is a massive site I am finding it hard to work out the spiders paths. Is there any way using the log files and excel or other tools to work this out simply? Also I was expecting the bot to almost instantaneously go through each level eg. main page--> category page ---> subcategory page (expecting same time stamp) but this does not appear to be the case. Does the bot follow a path right through to the deepest level it can/allowed to for that crawl and then returns to the higher level category pages at a later time? Any help would be appreciated
Cheers
-
Can you explain to me how you did your site map for this please?
-
I've run into the same issue for a site with 40 k + pages - far from your overall page # but still .. maybe it's the same flow overall.
The site I was working on had a structure of about 5 level deep. Some of the areas within the last level were out of reach and they didn't get indexed. More then that even a few areas on level 2 were not present in the google index and the google boot didn't visit those either.
I've created a large xml site map and a dynamic html sitemap with all the pages from the site and submit it via webmaster tool (the xml sitemap that is) but that didn't solve the issue and the same areas were out of the index and didn't got hit. Anyway the huge html sitemap was impossible to follow from a user point of view so I didn't keep that online for long but I am sure it can't work that way either.
What i did that finally solved the issue was to spot the exact areas that were left out, identify the "head" of those pages - that means several pages that acted as gateway for the entire module and I've build a few outside links that pointed to those pages directly and a few that were pointed to main internal pages of those modules that were left out.
Those pages gain authority fast and only in a few days we've spotted the google boot staying over night
All pages are now indexed and even ranking well.
If you can spot some entry pages that can conduct the spider to the rest you can try this approach - it should work for you too.
As far as links I've started with social network links, a few posts with links within the site blog (so that means internal links) and only a couple of outside links - articles with content links for those pages. Overall I think we are talking about 20-25 social network links (twitter, facebook, digg, stumble and delic), about 10 blog posts published in a 2-3 days span and about 10 articles in outside sources.
Since you have a much larger # as far as pages you probably will need more gateways and that means more links - but overall it's not a very time consuming session and it can solve your issue... hopefully
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Redirecting Pages During Site Migration
Hi everyone, We are changing a website's domain name. The site architecture will stay the same, but we are renaming some pages. How do we treat redirects? I read this on Search Engine Land: The ideal way to set up your redirects is with a regex expression in the .htaccess file of your old site. The regex expression should simply swap out your domain name, or swap out HTTP for HTTPS if you are doing an SSL migration. For any pages where this isn’t possible, you will need to set up an individual redirect. Make sure this doesn’t create any conflicts with your regex and that it doesn’t produce any redirect chains. Does the above mean we are able to set up a domain redirect on the regex for pages that we are not renaming and then have individual 1:1 redirects for renamed pages in the same .htaccess file? So have both? This will not conflict with the regex rule?
Intermediate & Advanced SEO | | nhhernandez0 -
Would it work to place H1 (or important page keywords) at the top of your page in HTML and move lower on page with CSS?
I understand that the H1 tag is no longer heavily correlated with stronger ranking signals but it is more important that Keywords or keyphrases are at the top of a page. My question is, if I just put my important keyword (or H1) toward the top of my page in the HTML and move it towards the middle/lower portion with css position elements, will this still be viewed by Googlebot as important keywords toward the top of my page? QCaxMHL
Intermediate & Advanced SEO | | Jonathan.Smith0 -
Does Google still don't index Hashtag Links ? No chance to get a Search Result that leads directly to a section of a page? or to one of numeras Hashtag Pages in a single HTML page?
Does Google still don't index Hashtag Links ? No chance to get a Search Result that leads directly to a section of a page? or to one of numeras Hashtag Pages in a single HTML page? If I have 4 or 5 different hashtag link section pages , consolidated into one HTML Page, no chance to get one of the Hashtag Pages to appear as a search result? like, if under one Single Page Travel Guide I have two essential sections: #Attractions #Visa no chance to direct search queries for Visa directly to the Hashtag Link Section of #Visa? Thanks for any help
Intermediate & Advanced SEO | | Muhammad_Jabali0 -
Multi-Location SEO: Sites vs Pages
I just started with a new company that requires multi-location SEO for its niche product/service. Currently, we have a main corporate website, as well as, 40+ individual dealer websites (we host all). Keep in mind each of these dealers consist of only 1-2 people, so corporate I will be managing the site or sites and content strategy. Many of the individual dealer sites actually rank very well (#1-#3) in their areas for our targeted keywords, but they all use the same duplicate content. Also, there are many dealer sites that have dropped off the radar in last year, which is probably because of the duplicate and static content. So I'm at a crossroads... Attempt to redo all of these location sites with unique and local content for each or Create optimized unique pages for each of them on our main site and redirect their current local domains to their page on our site Any advise regarding which direction to go in and why. Why is very important. It will be very difficult to convince a dealer that is #1 with his local site that we are redirecting to our main site, so I need some good ammo and reasoning. Also, any tips toward achieving local seo success will be greatly appreciated, too! Thank you!
Intermediate & Advanced SEO | | the-coopersmith0 -
Would spiders successfully crawl a page with two distinct sets of content?
Hello all and thank you in advance for the help. I have a coffee company that sell both retail and wholesale products. These are typically the same product, just at different prices. We are planning on having a pop up for users to help them self identify upon their first visit asking if they are retail or wholesale clients. So if someone clicks retail, the cookie will show them retail pricing throughout the site and vice versa for those that identify themselves as wholesale. I can talk to our programmer to find out how he actually plans on doing this from a technical standpoint if it would be of assistance. My question is, how will a spider crawl this site? I am assuming (probably incorrectly) that whatever the "default" selection is (for example, right now now people see retail pricing and then opt into wholesale) will be the information/pricing that they index. So long story short, how would a spider crawl a page that has two sets of distinct pricing information displayed based on user self identification? Thanks again!
Intermediate & Advanced SEO | | ClayPotCreative0 -
Development site crawled
We just found out our password protected development site has been crawled. We are worried about duplicate content - what are the best steps to take to correct this beyond adding to robots.txt?
Intermediate & Advanced SEO | | EileenCleary0 -
HTML5 one page website on-site SEO
Hey guys, If for example, I'm faced with a client who has a website similar to: http://www.symphonyonline.co.uk/ How should I proceed with the on-site optimization? Should I create new pages on the website? Should I create a blog for the site to increase my reach? Please give me your tips on how to proceed with this kind of website. Thanks.
Intermediate & Advanced SEO | | BruLee0 -
Our site is recieving traffic for both .com/page and .com/page/ with the trailing slash.
Our site is recieving traffic for both .com/page and .com/page/ with the trailing slash. Should we rewrite to just the trailing slash or without because of duplicates. The other question is, if we do a rewrite, google has indexed some pages with the slash and some without - i am assuming we will lose rank for one of them once we do the rewrite, correct?
Intermediate & Advanced SEO | | Profero0