Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How can I get a list of every url of a site in Google's index?
-
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is.
Is there a way to get an excel sheet with every url Google has indexed for a site?
Thanks... Mike
-
If this is still an issue you're facing, have you checked the sitemap settings to see which page types are getting included? For example, a site with a few thousand tags that are not entered in the sitemap but not yet set to noindex could easily produce extra pages like this.
The next step is parameterization. Anything going on there with search URLs or product URLs? eg ?refid=1235134&q=search+term or ?prod=152134&variant=blue
If you really want to scrape through Google, get a list of your sitemap and scrape queries like "inurl:domain.com/a", "inurl:domain.com/b", "inurl:domain.com/c". etc. This should allow you to dive deeper into the site map to see what Google really has indexed. For URL subfolders with tons of URLs like domain.com/product/a, you'll want to do the same thing at a subfolder level instead of root URLs.
-
You can do that with a tool like Scrapebox or Outwit. Go slow, or else you'll need to use proxies to get Google to respond fast enough. As another commenter mentioned, it's probably against TOS.
-
You could probably write a macro to do this, although just because you could doesn't mean you should. I don't think it is advisable because you do not want to violate any terms of use for anyone. That is never a good thing.
-
Yes, WMT API doesn't have it. The site site:xxxx.com search is where are got one of the two too high numbers. Thanks... Mike
-
Hi Marijn,
Thanks for the suggestions. 2.5 years of G/A organic landing pages is 10,000 urls.... 1/2 as many as the site map and 1/3rd as many as Google says indexed. On scraping google, do you know of a tool for that?
Thanks... Mike
-
Might be something you can get from the WMT API.
Also, to really see how many pages are indexed, do a site:xxxx.com search, go to the last page, include omitted results, go to the last page again, and add up how many you have. That's probably the most accurate number.
-
Hi Mike,
There a couple of solutions, neither of them provide you with 100% of data. The best would be to export a list of landing pages from Google Analytics or your favorite web analytics tool segmented by organic search/ Google. This would provide you with a list of pages that received traffic via search and so are indexed. If you cross reference them with your sitemaps that might already help you out a bit. Besides that you could crawl and scrape the URLS for a site:xxx.com search.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
SEO on Jobs sites: how to deal with expired listings with "Google for Jobs" around
Dear community, When dealing with expired job offers on jobs sites from a SEO perspective, most practitioners recommend to implement 301 redirects to category pages in order to keep the positive ranking signals of incoming links. Is it necessary to rethink this recommendation with "Google for Jobs" is around? Google's recommendations on how to handle expired job postings does not include 301 redirects. "To remove a job posting that is no longer available: Remove the job posting from your sitemap. Do one of the following: Note: Do NOT just add a message to the page indicating that the job has expired without also doing one of the following actions to remove the job posting from your sitemap. Remove the JobPosting markup from the page. Remove the page entirely (so that requesting it returns a 404 status code). Add a noindex meta tag to the page." Will implementing 301 redirects the chances to appear in "Google for Jobs"? What do you think?
Intermediate & Advanced SEO | | grnjbs07175 -
How to stop URLs that include query strings from being indexed by Google
Hello Mozzers Would you use rel=canonical, robots.txt, or Google Webmaster Tools to stop the search engines indexing URLs that include query strings/parameters. Or perhaps a combination? I guess it would be a good idea to stop the search engines crawling these URLs because the content they display will tend to be duplicate content and of low value to users. I would be tempted to use a combination of canonicalization and robots.txt for every page I do not want crawled or indexed, yet perhaps Google Webmaster Tools is the best way to go / just as effective??? And I suppose some use meta robots tags too. Does Google take a position on being blocked from web pages. Thanks in advance, Luke
Intermediate & Advanced SEO | | McTaggart0 -
Should I Add Location to ALL of My Client's URLs?
Hi Mozzers, My first Moz post! Yay! I'm excited to join the squad 🙂 My client is a full service entertainment company serving the Washington DC Metro area (DC, MD & VA) and offers a host of services for those wishing to throw events/parties. Think DJs for weddings, cool photo booths, ballroom lighting etc. I'm wondering what the right URL structure should be. I've noticed that some of our competitors do put DC area keywords in their URLs, but with the moves of SERPs to focus a lot more on quality over keyword density, I'm wondering if we should focus on location based keywords in traditional areas on page (e.g. title tags, headers, metas, content etc) instead of having keywords in the URLs alongside the traditional areas I just mentioned. So, on every product related page should we do something like: example.com/weddings/planners-washington-dc-md-va
Intermediate & Advanced SEO | | pdrama231
example.com/weddings/djs-washington-dc-md-va
example.com/weddings/ballroom-lighting-washington-dc-md-va OR example.com/weddings/planners
example.com/weddings/djs
example.com/weddings/ballroom-lighting In both cases, we'd put the necessary location based keywords in the proper places on-page. If we follow the location-in-URL tactic, we'd use DC area terms in all subsequent product page URLs as well. Essentially, every page outside of the home page would have a location in it. Thoughts? Thank you!!0 -
How can I make a list of all URLs indexed by Google?
I started working for this eCommerce site 2 months ago, and my SEO site audit revealed a massive spider trap. The site should have been 3500-ish pages, but Google has over 30K pages in its index. I'm trying to find a effective way of making a list of all URLs indexed by Google. Anyone? (I basically want to build a sitemap with all the indexed spider trap URLs, then set up 301 on those, then ping Google with the "defective" sitemap so they can see what the site really looks like and remove those URLs, shrinking the site back to around 3500 pages)
Intermediate & Advanced SEO | | Bryggselv.no0 -
Brackets vs Encoded URLs: The "Same" in Google's eyes, or dup content?
Hello, This is the first time I've asked a question here, but I would really appreciate the advice of the community - thank you, thank you! Scenario: Internal linking is pointing to two different versions of a URL, one with brackets [] and the other version with the brackets encoded as %5B%5D Version 1: http://www.site.com/test?hello**[]=all&howdy[]=all&ciao[]=all
Intermediate & Advanced SEO | | mirabile
Version 2: http://www.site.com/test?hello%5B%5D**=all&howdy**%5B%5D**=all&ciao**%5B%5D**=all Question: Will search engines view these as duplicate content? Technically there is a difference in characters, but it's only because one version encodes the brackets, and the other does not (See: http://www.w3schools.com/tags/ref_urlencode.asp) We are asking the developer to encode ALL URLs because this seems cleaner but they are telling us that Google will see zero difference. We aren't sure if this is true, since engines can get so _hung up on even one single difference in character. _ We don't want to unnecessarily fracture the internal link structure of the site, so again - any feedback is welcome, thank you. 🙂0 -
A few questions on Google's Structured Data Markup Helper...
I'm trying to go through my site and add microdata with the help of Google's Structured Data Markup Helper. I have a few questions that I have not been able to find an answer for. Here is the URL I am referring to: http://www.howlatthemoon.com/locations/location-chicago My company is a bar/club, with only 4 out of 13 locations serving food. Would you mark this up as a local business or a restaurant? It asks for "URL" above the ratings. Is this supposed to be the URL that ratings are on like Yelp or something? Or is it the URL for the page? Either way, neither of those URLs are on the page so I can't select them. If it is for Yelp should I link to it? How do I add reviews? Do they have to be on the page? If I make a group of days for Day of the Week for Opening hours, such as Mon-Thu, will that work out? I have events on this page. However, when I tried to do the markup for just the event it told me to use itemscope itemtype="http://schema.org/Event" on the body tag of the page. That is just a small part of the page, I'm not sure why I would put the event tag on the whole body? Any other tips would be much appreciated. Thanks!
Intermediate & Advanced SEO | | howlusa0 -
Does Google index url with hashtags?
We are setting up some Jquery tabs in a page that will produce the same url with hashtags. For example: index.php#aboutus, index.php#ourguarantee, etc. We don't want that content to be crawled as we'd like to prevent duplicate content. Does Google normally crawl such urls or does it just ignore them? Thanks in advance.
Intermediate & Advanced SEO | | seoppc20120 -
Include Cross Domain Canonical URL's in Sitemap - Yes or No?
I have several sites that have cross domain canonical tags setup on similar pages. I am unsure if these pages that are canonicalized to a different domain should be included in the sitemap. My first thought is no, because I should only include pages in the sitemap that I want indexed. On the other hand, if I include ALL pages on my site in the sitemap, once Google gets to a page that has a cross domain canonical tag, I'm assuming it will just note that and determine if the canonicalized page is the better version. I have yet to see any errors in GWT about this. I have seen errors where I included a 301 redirect in my sitemap file. I suspect its ok, but to me, it seems that Google would rather not find these URL's in a sitemap, have to crawl them time and time again to determine if they are the best page, even though I'm indicating that this page has a similar page that I'd rather have indexed.
Intermediate & Advanced SEO | | WEB-IRS0