What Sources to use to compile an as comprehensive list of pages indexed in Google?
-
As part of a Panda recovery initiative we are trying to get an as comprehensive list of currently URLs indexed by Google as possible.
Using the site:domain.com operator Google displays that approximately 21k pages are indexed. Scraping the results however ends after the listing of 240 links.
Are there any other sources we could be using to make the list more comprehensive? To be clear, we are not looking for external crawlers like the SEOmoz crawl tool but sources that would be confidently allow us to determine a list of URLs currently hold in the Google index.
Thank you /Thomas
-
We don't usually take private info in public questions, but if you want to, Private Message me the domain (via my profile). I'm really curious about (1) and I'd love to take a peek.
-
Thanks Pete,
As always very much appreciate your input.
1/ We aren't using any parameters and when using the filter=0 we are getting the same results. For my just done test I was only able to pull 350 pages out of 18.5k pages using the web interface. If anyone has any other thoughts on this please let me now.
2/ That is a great idea. Most of our pages live in the root directory to keep the URL slugs short so unfortunately this one will not help us.
3/ Another good idea. I understand this approach is helpful to see your coverage of wanted pages in the Google index but won't be able to help you determine superfluous pages currently in the Google index unless I misunderstood you?
4/ We are using ScreamingFrog and I agree its a fantastic tool. The index size with ScreamingFrog is showing not more than 300 pages which is our final goal.
Overall we are seeing continuous yet small drops to the index size using our approach of returning 410 response codes for unwanted pages and dedicated sitemaps to speed up delisting. See http://www.seomoz.org/q/panda-recovery-what-is-the-best-way-to-shrink-your-index-and-make-google-aware
We are just trying to get a more complete list of whats currently in the index to speed up delisting.
Thank you for your reference to the Panda post I remember reading it before and will give it another go right now.
One final question, in your experience dealing with Panda penalties, have you seen scenarios where it seems the delisting/penalizing of a site has only happened for a particular CCTLD of google or just the homepage? See http://www.seomoz.org/q/panda-penguin-penalty-not-global-but-only-firea-for-specific-google-cctlds It is what we are currently experiencing and trying to see if other people have observed something similar.
Best /Thomas
-
If you're willing to piece together multiple sources, I can definitely give you some starting points:
(1) First, dropping from 21K pages indexed in Google to 240 definitely seems odd. Are you hitting omitted results? You may have to shut off filtering in the URL (&filter=0).
(2) You can also divide the site up logically and run "site:" on sub-folders, parameters, etc. Say, for example:
site:example.com/blog
site:example.com/shop
site:example.com/uk
As long as there's some logical structure, you can use it to break the index request down into smaller chunks. Don't forget to use inurl: for URL parameters (filters, pagination, etc.).
(3) This takes a while, but split up your XML sitemaps into logical clusters - say, one for major pages, one for top-level topics/categories, one for sub-categories, one for products. That way, you'll get a cleaner could of what kind of pages are indexed, and you'll know where your gaps are.
(4) Run a desktop crawler on the site, like Xenu or Screaming Frog (Xenu is free, but PC only and harder to use. Screaming Frog has a yearly fee, but it's an excellent tool). This won't necessarily tell you what Google has indexed, but it will help you see how your site is being crawled and where problems are occurring.
I wrote a mega-post a while back on all the different kinds of duplicate content. Sometimes, just seeing examples can help you catch a problem you might be having. It's at:
http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world
-
Does anyone have any insight on this? If the answer is simply there is no better approach than look at the limited data available through the Google UI this would be helpful as well.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Why Google isn't indexing my images?
Hello, on my fairly new website Worthminer.com I am noticing that Google is not indexing images from my sitemap. Already 560 images submitted and Google indexed only 3 of them. Altough there is more images indexed they are not indexing any new images, and I have no idea why. Posts, categories and other urls are indexing just fine, but images not. I am using Wordpress and for sitemaps Wordpress SEO by yoast. Am I missing something here? Why Google won't index my images? Thanks, I appreciate any help, David xv1GtwK.jpg
Intermediate & Advanced SEO | | Worthminer1 -
JavaScript Issue? Google not indexing a microsite
We have a microsite that was created on our domain but is not linked to from ANYwhere EXCEPT within some Javascript elements on pages on our site. The link is in one JQuery slide panel. The microsite is not being indexed at all - when i do site:(microsite name) on Google, it doesn't return anything. I think it's because the link's only in a Java element, but my client assures me that if I submit to Google for crawling the problem will be solved. Maybe so, but my point is that if you just create a simple HTML link from at least one of our site pages, it will get indexed no problem. The microsite has been up for months and it's still not being indexed - another newer microsite that's been up for a few weeks and has simple links to it from our pages is indexing fine. I have submitted the URL for crawling but had to use the google.com/webmasters/tools/submit-url/ method as I don't have access to the top level domain WMT account. p.s. when we put the microsite URL into the SEOBook spider-test tool it returns lots of lovely information - but that just tells me the page is findable, does exist, right? That doesn't mean Google's going to necessarily index it, as I am surmising...Moz hasn't found in the 5 months the microsite has been up and running. What's going on here?
Intermediate & Advanced SEO | | Jen_Floyd0 -
Site not indexed in Google UK
This site was moved to a new host by the client a month back and is still not indexed in Google UK if you search for the site directly. www.loftconversionswestsussex.com Webmaster tools shows that 55 pages have been crawled and no errors have been detected. The client also tried the "Fetch as Google Bot" tactic in GWT as well as running a PPC campaign and the site is still not appearing in Google. Any thoughts please? Cheers, SEO5..
Intermediate & Advanced SEO | | SEO5Team0 -
Optimizing WordPress Pages With List of Posts
A friend of mine has published a new site called www.localsguidesa.com. It is an informational/review site about a resort town. Most of my experience has been dealing with single html pages. In the case of this site, the main "money keyword" pages are mainly an introduction of text followed by a list and snipped of blog posts such as this page http://localsguidesa.com/what-to-see-do/attractions which would target St Augustine Attractions. Would she be better off making the main pages with more content and less blog posts? How would ranking be affected with all the preview blog posts on the page? The strategy is for the blog posts to rank on the longer tail keywords...such as "Top 10 Attractions in St Augustine ", but what suggestions would you have for a main navigation page such as http://localsguidesa.com/what-to-see-do/attractions
Intermediate & Advanced SEO | | Pinlaser1 -
Should I use rel=canonical on similar product pages.
I'm thinking of using rel=canonical for similar products on my site. Say I'm selling pens and they are al very similar. I.e. a big pen in blue, a pack of 5 blue bic pens, a pack of 10, 50, 100 etc. should I rel=canonical them all to the best seller as its almost impossible to make the pages unique. (I realise the best I realise these should be attributes and not products but I'm sure you get my point) It seems sensible to have one master canonical page for bic pens on a site that has a great description video content and good images plus linked articles etc rather than loads of duplicate looking pages. love to hear thoughts from the Moz community.
Intermediate & Advanced SEO | | mark_baird0 -
Google indexing issue?
Hey Guys, After a lot of hard work, we finally fixed the problem on our site that didn't seem to show Meta Descriptions in Google, as well as "noindex, follow" on tags. Here's my question: In our source code, I am seeing both Meta descriptions on pages, and posts, as well as noindex, follow on tag pages, however, they are still showing the old results and tags are also still showing in Google search after about 36 hours. Is it just a matter of time now or is something else wrong?
Intermediate & Advanced SEO | | ttb0 -
Website is not getting indexed in Google! Not sure why?
I just came up with my new blog, its not live yet but the 1<sup>st</sup> landing page is ready, up and running… all is fine but here is the only problem is its not getting indexed in Google and I am not really sure why? .xml sitemap is there Google webmaster and analytics are there Website contain at least that much real social shares that it should get indexed in Google Few Links may be coming from Famous Bloggers and SEOmoz (both sites are very authentic in their respective domains) It’s the 4 day the website is up I don’t think website is not getting indexed in Google just because it contains 1 landing page and a thank you page! Any clue or help will be appreciated. www.setalks.com is the domain
Intermediate & Advanced SEO | | MoosaHemani0 -
Google indexing flash content
Hi Would googles indexing of flash content count towards page content? for example I have over 7000 flash files, with 1 unique flash file per page followed by a short 2 paragraph snippet, would google count the flash as content towards the overall page? Because at the moment I've x-tagged the roberts with noindex, nofollow and no archive to prevent them from appearing in the search engines. I'm just wondering if the google bot visits and accesses the flash file it'll get the x-tag noindex, nofollow and then stop processing. I think this may be why the panda update also had an effect. thanks
Intermediate & Advanced SEO | | Flapjack0