What Sources to use to compile an as comprehensive list of pages indexed in Google?
-
As part of a Panda recovery initiative we are trying to get an as comprehensive list of currently URLs indexed by Google as possible.
Using the site:domain.com operator Google displays that approximately 21k pages are indexed. Scraping the results however ends after the listing of 240 links.
Are there any other sources we could be using to make the list more comprehensive? To be clear, we are not looking for external crawlers like the SEOmoz crawl tool but sources that would be confidently allow us to determine a list of URLs currently hold in the Google index.
Thank you /Thomas
-
We don't usually take private info in public questions, but if you want to, Private Message me the domain (via my profile). I'm really curious about (1) and I'd love to take a peek.
-
Thanks Pete,
As always very much appreciate your input.
1/ We aren't using any parameters and when using the filter=0 we are getting the same results. For my just done test I was only able to pull 350 pages out of 18.5k pages using the web interface. If anyone has any other thoughts on this please let me now.
2/ That is a great idea. Most of our pages live in the root directory to keep the URL slugs short so unfortunately this one will not help us.
3/ Another good idea. I understand this approach is helpful to see your coverage of wanted pages in the Google index but won't be able to help you determine superfluous pages currently in the Google index unless I misunderstood you?
4/ We are using ScreamingFrog and I agree its a fantastic tool. The index size with ScreamingFrog is showing not more than 300 pages which is our final goal.
Overall we are seeing continuous yet small drops to the index size using our approach of returning 410 response codes for unwanted pages and dedicated sitemaps to speed up delisting. See http://www.seomoz.org/q/panda-recovery-what-is-the-best-way-to-shrink-your-index-and-make-google-aware
We are just trying to get a more complete list of whats currently in the index to speed up delisting.
Thank you for your reference to the Panda post I remember reading it before and will give it another go right now.
One final question, in your experience dealing with Panda penalties, have you seen scenarios where it seems the delisting/penalizing of a site has only happened for a particular CCTLD of google or just the homepage? See http://www.seomoz.org/q/panda-penguin-penalty-not-global-but-only-firea-for-specific-google-cctlds It is what we are currently experiencing and trying to see if other people have observed something similar.
Best /Thomas
-
If you're willing to piece together multiple sources, I can definitely give you some starting points:
(1) First, dropping from 21K pages indexed in Google to 240 definitely seems odd. Are you hitting omitted results? You may have to shut off filtering in the URL (&filter=0).
(2) You can also divide the site up logically and run "site:" on sub-folders, parameters, etc. Say, for example:
site:example.com/blog
site:example.com/shop
site:example.com/uk
As long as there's some logical structure, you can use it to break the index request down into smaller chunks. Don't forget to use inurl: for URL parameters (filters, pagination, etc.).
(3) This takes a while, but split up your XML sitemaps into logical clusters - say, one for major pages, one for top-level topics/categories, one for sub-categories, one for products. That way, you'll get a cleaner could of what kind of pages are indexed, and you'll know where your gaps are.
(4) Run a desktop crawler on the site, like Xenu or Screaming Frog (Xenu is free, but PC only and harder to use. Screaming Frog has a yearly fee, but it's an excellent tool). This won't necessarily tell you what Google has indexed, but it will help you see how your site is being crawled and where problems are occurring.
I wrote a mega-post a while back on all the different kinds of duplicate content. Sometimes, just seeing examples can help you catch a problem you might be having. It's at:
http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world
-
Does anyone have any insight on this? If the answer is simply there is no better approach than look at the limited data available through the Google UI this would be helpful as well.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I worry about rendering problems of my pages in google search console fetch as google?
Some elements are not properly shown when I preview our pages in search console (fetch as google), e.g.
Intermediate & Advanced SEO | | lcourse
google maps, css tables etc. and some parts are not showing up since we load them asynchroneously for best page speed. Is this something should pay attention to and try to fix?0 -
How do you check the google cache for hashbang pages?
So we use http://webcache.googleusercontent.com/search?q=cache:x.com/#!/hashbangpage to check what googlebot has cached but when we try to use this method for hashbang pages, we get the x.com's cache... not x.com/#!/hashbangpage That actually makes sense because the hashbang is part of the homepage in that case so I get why the cache returns back the homepage. My question is - how can you actually look up the cache for hashbang page?
Intermediate & Advanced SEO | | navidash0 -
Google indexed wrong pages of my website.
When I google site:www.ayurjeewan.com, after 8 pages, google shows Slider and shop pages. Which I don't want to be indexed. How can I get rid of these pages?
Intermediate & Advanced SEO | | bondhoward0 -
Using Canonical URL to poin to an external page
I was wondering if I can use a canonical URL that points to a page residing on external site? So a page like:
Intermediate & Advanced SEO | | llamb
www.site1.com/whatever.html will have a canonical link in its header to www.site2.com/whatever.html. Thanks.0 -
Google Processing but Not Indexing XML Sitemap
Like it says above, Google is processing but not indexing our latest XML sitemap. I noticed this Monday afternoon - Indexed status was still Pending - and didn't think anything of it. But when it still said Pending on Tuesday, it seemed strange. I deleted and resubmitted our XML sitemap on Tuesday. It now shows that it was processed on Tuesday, but the Indexed status is still Pending. I've never seen this much of a lag, hence the concern. Our site IS indexed in Google - it shows up with a site:xxxx.com search with the same number of pages as it always has. The only thing I can see that triggered this is Sunday the site failed verification via Google, but we quickly fixed that and re-verified via WMT Monday morning. Anyone know what's going on?
Intermediate & Advanced SEO | | Kingof50 -
New Web Page Not Indexed
Quick question with probably a straightforward answer... We created a new page on our site 4 days ago, it was in fact a mini-site page though I don't think that makes a difference... To date, the page is not indexed and when I use 'Fetch as Google' in WT I get a 'Not Found' fetch status... I have also used the'Submit URL' in WT which seemed to work ok... We have even resorted to 'pinging' using Pinglar and Ping-O-Matic though we have done this cautiously! I know social media is probably the answer but we have been trying to hold back on that tactic as the page relates to a product that hasn't quite launched yet and we do not want to cause any issues with the vendor! That said, I think we might have to look at sharing the page socially unless anyone has any other ideas? Many thanks Andy
Intermediate & Advanced SEO | | TomKing0 -
Google Listings
How can i make my pages appear in google results such as menu, diner, hours, contact us etc.. when some searches for my keyword or domain take a look at this screen shot Thanks UbqY4kwA UbqY4kwA
Intermediate & Advanced SEO | | vlad_mezoz0 -
Why does Google Claimed Local Listing Ranking Drop?
I have two local google places listinggs unlaimed. Both listings were ranking in the blended search in 7 pack. Once I claimed the local listings for the business both listings rankings have dropped. And one has totally vanished from the search rankings. Is this normal as it appears local places that are not claimed are ranking higher than local places claimed?
Intermediate & Advanced SEO | | VivaArturo0