Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
-
I'm doing a detailed analysis of how Google sees and indexes our website and we have found that there are 240,256 pages in the index which is way too many. It's an e-commerce site that needs some tidying up.
I'm working with an SEO specialist to set up URL parameters and put information in to the robots.txt file so the excess pages aren't indexed (we shouldn't have any more than around 3,00 - 4,000 pages) but we're struggling to find a way to get a list of these 240,256 pages as it would be helpful information in deciding what to put in the robots.txt file and which URL's we should ask Google to remove.
Is there a way to get a list of the URL's indexed? We can't find it in the Google Webmaster Tools.
-
Looks like I can only do the first thousand. It's a start though. Thank you for the information.
Many of the URL's on my list, when put in to Google search, are giving me 80-100 other variants I can remove by hand.
http://www.mathewporter.co.uk/list-a-domains-indexed-pages-in-google-docs/ for anyone else following.
-
Finally getting around to doing this and noticed that when I change the start number to anything above 900, it doesn't work - ie: it's only letting me look at the first 1,000 results for some reason.
The list of 1,000 has given me some good URL's to search off for the filtering thingy that was generating all the garbage URL's but I'd love to get past 1,000 if I can.
Does anyone know how?
-
Correct. I have gone in to URL Parameters already and set them to Crawl 'No URLs' for those we don't want crawled.
We haven't added those parameters listed in there in to the robots.txt file yet, but I will do that now. I had an initial consult today and we ran way over time when we discovered all this stuff so I have another appointment in a couple of weeks.
We have a sitemap of all the category pages and relevant static pages on the site already and Google has those indexed nicely. We just need to get rid of the 240,000 pages it has indexed that we don't want in there (frightening I know - it's a really high number).
I greatly appreciate you taking the time to respond. Thank you.
-
Thanks. There's a lot of auto-generated content, duplicate pages and we've set the robots.txt file up to exclude a large number of them. Now we wait.
Very helpful and greatly appreciated. Thank you.
-
Hi,
I'm going to assume that as you have said it's an e-commerce site that the URL parameters are created by product variations, filters, sorts etc. If so then you must already be seeing those parameters on the URL of your site as you navigate and in your analytics or search results.
Your SEO specialist should easily be able to add those parameters to the robots file. Then personally I would resubmit a site map for completeness and wait for results to take effect.
-
Joanne,
I'm afraid there's no way to know which pages are actually indexed from your Webmaster Tools. You can use a simple search in Google: site:domain.com and it will list "all" your indexed pages, however, there's no way to export that as a report.
You can create a report using some "hack". Login to your Google Drive, create a new spreadsheet and use the following command to populate rows:
=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1"; "//cite")
This will load the first 100 results. You will need to repeat the process for every 1000 results you have, changing the last variable: "start=1" to "start=100" and then "start=200", etc (you see where I'm going). This could really be a pain in the butt for your site's size.
My recommendation is you navigate your own site, decide which pages should be removed and then create the robots.txt regardless what google has indexed. Once you complete your robots.txt, it will take a few weeks (or even a month) to have the blocked pages removed.
Hope that helps!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google webcache of product page redirects back to product page
Hi all– I've legitimately never seen this before, in any circumstance. I just went to check the google webcache of a product page on our site (was just grabbing the last indexation date) and was immediately redirected away from google's cached version BACK to the site's standard product page. I ran a status check on the product page itself and it was 200, then ran a status check on the webcache version and sure enough, it registered as redirected. It looks like this is happening for ALL indexed product pages across the site (several thousand), and though organic traffic has not been affected it is starting to worry me a little bit. Has anyone ever encountered this situation before? Why would a google webcache possibly have any reason to redirect? Is there anything to be done on our side? Thanks as always for the help and opinions, y'all!
Intermediate & Advanced SEO | | TukTown1 -
My site shows 503 error to Google bot, but can see the site fine. Not indexing in Google. Help
Hi, This site is not indexed on Google at all. http://www.thethreehorseshoespub.co.uk Looking into it, it seems to be giving a 503 error to the google bot. I can see the site I have checked source code Checked robots Did have a sitemap param. but removed it for testing GWMT is showing 'unreachable' if I submit a site map or fetch Any ideas on how to remove this error? Many thanks in advance
Intermediate & Advanced SEO | | SolveWebMedia0 -
Google Search Analytics How to Get Search Keywords for a Page?
How do I get the keywords coming into a page on the new Google Webmaster Tools Search Analytics? Used to be there in the old version. You would just view your most popular urls and when you expanded the urls you would see the terms driving the traffic. How do I see the most popular keyword queries for a given page in the new tool? Alternatively can I still use the old tool somehow?
Intermediate & Advanced SEO | | K-WINTER0 -
How to handle broken links to phantom pages appearing in webmaster tools
Hi,Would love to hear different experiences and thoughts on this one. We have a site that is plagued with 404's in the Webmaster Tools. A significant number of them have never existed, for instance affiliates have linked to them with the wrong URL or scraper sites have linked to them with a truncated version of the URL and an ellipsis eg; /my-nonexistent... What's the best way to handle these? If we do nothing and mark as fixed, they reappear in the broken links report. If we 301 redirect and mark as fixed they reappear. We tried 410 (gone forever) and marking as fixed; they re-appeared. We have a lot of legacy broken links and we would really like to clean up our WMT broken link profile - does anyone know of a way we can make these links to non extistent pages disappear once and for all? Many thanks in advance!
Intermediate & Advanced SEO | | dancape0 -
When does Google index a fetched page?
I have seen where it will index on of my pages within 5 minutes of fetching, but have also read that it can take a day. I'm on day #2 and it appears that it has still not re-indexed 15 pages that I fetched. I changed the meta-description in all of them, and added content to nearly all of them, but none of those changes are showing when I do a site:www.site/page I'm trying to test changes in this manner, so it is important for me to know WHEN a fetched page has been indexed, or at least IF it has. How can I tell what is going on?
Intermediate & Advanced SEO | | friendoffood0 -
Wrong page getting ranked
Hi all, we have product category pages on our ecommerce web site and we also produce blog content (such as buyers guides, setup guides etc) to help with ranking and give our site some good quality, unique content. However we are sometimes finding that the buyers guide / blog content gets ranked by Google over our product category page. I'm hoping, if I give an example or two, some one smart out there may be able to point me in the right direction as to how we can avoid this and get the product category page ranked instead? You will see from my examples we are linking internally using the keywords from the buyers guides to the product category pages in order to show the most important page to Google for these keywords and are trying to structure the product category pages as well as possible to make it the most optimized page for the term. Example: Keyword "twin dvd player"... product category page: http://www.3wisemonkeys.co.uk/dvd/portable-dvd-player-car/twin-dvd-player/ ... blog page actually getting ranked for this keyword: http://www.3wisemonkeys.co.uk/advice-center/dual-screen-and-twin-dvd-player-explained/ Keyword "site radio".... product category page: http://www.3wisemonkeys.co.uk/audio/radio/site-radio/ .... blog buyer guide page actually getting ranked for keyword: http://www.3wisemonkeys.co.uk/advice-center/Site-radio-buying-guide/ Any help / pointers appreciated. Thanks.
Intermediate & Advanced SEO | | jasef0 -
Are pages with a canonical tag indexed?
Hello here, here are my questions for you related to the canonical tag: 1. If I put online a new webpage with a canonical tag pointing to a different page, will this new page be indexed by Google and will I be able to find it in the index? 2. If instead I apply the canonical tag to a page already in the index, will this page be removed from the index? Thank you in advance for any insights! Fabrizio
Intermediate & Advanced SEO | | fablau0 -
Will Google Revisit a 403 Page
Hi, We've got some pretty strict anti-scraping logic in our website, and it seems we accidentally snared a Googlebot with it. About 100 URL requests were responded to with a 403 Forbidden error. The logic has since been updated, so this should not happen again. I was just wondering if/when Googlebot will come back and try those URLs again. They are linked from other pages on the site, and they are also in our sitemap. Thanks in advance for any assistance.
Intermediate & Advanced SEO | | dbuckles0