What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Webshop landing pages and product pages
Hi, I am doing extensive keyword research for the SEO of a big webshop. Since this shop sells technical books and software (legal books, tax software and so on), I come across a lot of very specific keywords for separate products. Isn't it better to try and rank in the SERP's with all the separate product pages, instead of with the landing (category) pages?
Intermediate & Advanced SEO | | Mat_C0 -
Home page vs inner page?
do you believe that the advantage of targeting a search term on the home page is now worse off than before? as I understand it ctr is a big factor now And as far as i can see if two pages are equal on page etc the better ctr will win out, the issue with the home page is the serp stars cannot be used hence the ctr on a product page will be higher? I feel if you where able to get a home page up quicker (1 year instead of two) you still lost out in the end due to the product page winning on ctr? do you think this is correct?
Intermediate & Advanced SEO | | BobAnderson0 -
Google crawling 200 page site thousands of times/day. Why?
Hello all, I'm looking at something a bit wonky for one of the websites I manage. It's similar enough to other websites I manage (built on a template) that I'm surprised to see this issue occurring. The xml sitemap submitted shows Google there are 229 pages on the site. Starting in the beginning of December Google really ramped up their intensity in crawling the site. At its high point Google crawled 13,359 pages in a single day. I mentioned I manage other similar sites - this is a very unusual spike. There are no resources like infinite scroll that auto generates content and would cause Google some grief. So follow up questions to my "why?" is "how is this affecting my SEO efforts?" and "what do I do about it?". I've never encountered this before, but I think limiting my crawl budget would be treating the symptom instead of finding the cure. Any advice is appreciated. Thanks! *edited for grammar.
Intermediate & Advanced SEO | | brettmandoes0 -
How to optimize count of interlinking by increasing Interlinking count of chosen landing pages and decreasing for less important pages within the site?
We have taken out our interlinking counts (Only Internal Links and not Outbound Links) through Google WebMaster tool and discovered that the count of interlinking of our most significant pages are less as compared to of less significant pages. Our objective is to reverse the existing behavior by increasing Interlinking count of important pages and reduce the count for less important pages so that maximum link juice could be transferred to right pages thereby increasing SEO traffic.
Intermediate & Advanced SEO | | vivekrathore0 -
Making Filtered Search Results Pages Crawlable on an eCommerce Site
Hi Moz Community! Most of the category & sub-category pages on one of our client's ecommerce site are actually filtered internal search results pages. They can configure their CMS for these filtered cat/sub-cat pages to have unique meta titles & meta descriptions, but currently they can't apply custom H1s, URLs or breadcrumbs to filtered pages. We're debating whether 2 out of 5 areas for keyword optimization is enough for Google to crawl these pages and rank them for the keywords they are being optimized for, or if we really need three or more areas covered on these pages as well to make them truly crawlable (i.e. custom H1s, URLs and/or breadcrumbs)…what do you think? Thank you for your time & support, community!
Intermediate & Advanced SEO | | accpar0 -
Site migration from non canonicalized site
Hi Mozzers - I'm working on a site migration from a non-canonicalized site - I am wondering about the best way to deal with that - should I ask them to canonicalize prior to migration? Many thanks.
Intermediate & Advanced SEO | | McTaggart0 -
Hey guys i have this issues on my crawling report what should i do to exlude the pages? are d
Overly-Dynamic URL Overly-Dynamic URL Although search engines can crawl dynamic URLs, search engine representatives have warned against using over 2 parameters in a given URL. Search engines may also see dynamic versions of the same URL as unique URLs, creating duplicate content.
Intermediate & Advanced SEO | | adulter0 -
Pages un-indexed in my site
My current website www.energyacuity.com has had most pages indexed for more than a year. However, I tried cache a few of the pages, and it looks the only one that is now indexed by Goggle is the homepage. Any thoughts on why this is happening?
Intermediate & Advanced SEO | | abernatj0