Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
-
I'm doing a detailed analysis of how Google sees and indexes our website and we have found that there are 240,256 pages in the index which is way too many. It's an e-commerce site that needs some tidying up.
I'm working with an SEO specialist to set up URL parameters and put information in to the robots.txt file so the excess pages aren't indexed (we shouldn't have any more than around 3,00 - 4,000 pages) but we're struggling to find a way to get a list of these 240,256 pages as it would be helpful information in deciding what to put in the robots.txt file and which URL's we should ask Google to remove.
Is there a way to get a list of the URL's indexed? We can't find it in the Google Webmaster Tools.
-
Looks like I can only do the first thousand. It's a start though. Thank you for the information.
Many of the URL's on my list, when put in to Google search, are giving me 80-100 other variants I can remove by hand.
http://www.mathewporter.co.uk/list-a-domains-indexed-pages-in-google-docs/ for anyone else following.
-
Finally getting around to doing this and noticed that when I change the start number to anything above 900, it doesn't work - ie: it's only letting me look at the first 1,000 results for some reason.
The list of 1,000 has given me some good URL's to search off for the filtering thingy that was generating all the garbage URL's but I'd love to get past 1,000 if I can.
Does anyone know how?
-
Correct. I have gone in to URL Parameters already and set them to Crawl 'No URLs' for those we don't want crawled.
We haven't added those parameters listed in there in to the robots.txt file yet, but I will do that now. I had an initial consult today and we ran way over time when we discovered all this stuff so I have another appointment in a couple of weeks.
We have a sitemap of all the category pages and relevant static pages on the site already and Google has those indexed nicely. We just need to get rid of the 240,000 pages it has indexed that we don't want in there (frightening I know - it's a really high number).
I greatly appreciate you taking the time to respond. Thank you.
-
Thanks. There's a lot of auto-generated content, duplicate pages and we've set the robots.txt file up to exclude a large number of them. Now we wait.
Very helpful and greatly appreciated. Thank you.
-
Hi,
I'm going to assume that as you have said it's an e-commerce site that the URL parameters are created by product variations, filters, sorts etc. If so then you must already be seeing those parameters on the URL of your site as you navigate and in your analytics or search results.
Your SEO specialist should easily be able to add those parameters to the robots file. Then personally I would resubmit a site map for completeness and wait for results to take effect.
-
Joanne,
I'm afraid there's no way to know which pages are actually indexed from your Webmaster Tools. You can use a simple search in Google: site:domain.com and it will list "all" your indexed pages, however, there's no way to export that as a report.
You can create a report using some "hack". Login to your Google Drive, create a new spreadsheet and use the following command to populate rows:
=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1"; "//cite")This will load the first 100 results. You will need to repeat the process for every 1000 results you have, changing the last variable: "start=1" to "start=100" and then "start=200", etc (you see where I'm going). This could really be a pain in the butt for your site's size.
My recommendation is you navigate your own site, decide which pages should be removed and then create the robots.txt regardless what google has indexed. Once you complete your robots.txt, it will take a few weeks (or even a month) to have the blocked pages removed.
Hope that helps!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Staging website got indexed by google
Our staging website got indexed by google and now MOZ is showing all inbound links from staging site, how should i remove those links and make it no index. Note- we already added Meta NOINDEX in head tag
Intermediate & Advanced SEO | | Asmi-Ta0 -
"Null" appearing as top keyword in "Content Keywords" under Google index in Google Search Console
Hi, "Null" is appearing as top keyword in Google search console > Google Index > Content Keywords for our site http://goo.gl/cKaQ4K . We do not use "null" as keyword on site. We are not able to find why Google is treating "null" as a keyword for our site. Is anyone facing such issue. Thanks & Regards
Intermediate & Advanced SEO | | vivekrathore0 -
My blog is indexing only the archive and category pages
Hi there MOZ community. I am new to the QandA and have a question. I have a blog Its been live for months - but I can not get the posts to rank in the serps. Oddly only the categories rank. The posts are crawled it seems - but seen as less important for a reason I don't understand. Can anyone here help with this? See here for what i mean. I have had several wp sites rank well in the serps - and the posts do much better. Than the categories or archives - super odd. Thanks to all for help!
Intermediate & Advanced SEO | | walletapp0 -
Best way to remove full demo (staging server) website from Google index
I've recently taken over an in-house role at a property auction company, they have a main site on the top-level domain (TLD) and 400+ agency sub domains! company.com agency1.company.com agency2.company.com... I recently found that the web development team have a demo domain per site, which is found on a subdomain of the original domain - mirroring the site. The problem is that they have all been found and indexed by Google: demo.company.com demo.agency1.company.com demo.agency2.company.com... Obviously this is a problem as it is duplicate content and so on, so my question is... what is the best way to remove the demo domain / sub domains from Google's index? We are taking action to add a noindex tag into the header (of all pages) on the individual domains but this isn't going to get it removed any time soon! Or is it? I was also going to add a robots.txt file into the root of each domain, just as a precaution! Within this file I had intended to disallow all. The final course of action (which I'm holding off in the hope someone comes up with a better solution) is to add each demo domain / sub domain into Google Webmaster and remove the URLs individually. Or would it be better to go down the canonical route?
Intermediate & Advanced SEO | | iam-sold0 -
How to add subdomains to webmaster tools?
Can anyone help with how I add a sub domain to webmaster tools? Also do I need to create a seperate sitemap for each sub domain? Any help appreciated!
Intermediate & Advanced SEO | | SamCUK1 -
Are pages with a canonical tag indexed?
Hello here, here are my questions for you related to the canonical tag: 1. If I put online a new webpage with a canonical tag pointing to a different page, will this new page be indexed by Google and will I be able to find it in the index? 2. If instead I apply the canonical tag to a page already in the index, will this page be removed from the index? Thank you in advance for any insights! Fabrizio
Intermediate & Advanced SEO | | fablau0 -
How to get content to index faster in Google.....pubsubhubbub?
I'm curious to know what tools others are using to get their content to index faster (other than html sitmap and pingomatic, twitter, etc) Would installing the wordpress pubsubhubbub plugin help even though it uses pingomatic? http://wordpress.org/extend/plugins/pubsubhubbub/
Intermediate & Advanced SEO | | webestate0 -
Does Google index url with hashtags?
We are setting up some Jquery tabs in a page that will produce the same url with hashtags. For example: index.php#aboutus, index.php#ourguarantee, etc. We don't want that content to be crawled as we'd like to prevent duplicate content. Does Google normally crawl such urls or does it just ignore them? Thanks in advance.
Intermediate & Advanced SEO | | seoppc20120