What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I set a max crawl rate in Webmaster Tools?
We have a website with around 5,000 pages and for the past few months we've had our crawl rate set to maximum (we'd just started paying for a top of the range dedicated server at the time, so performance wasn't an issue). Google Webmaster Tools has alerted me this morning that the crawl rate has expired so I'd have to manually set the rate again. In terms of SEO, is having a max rate a good thing? I found this post on Moz, but it's dated from 2008. Any thoughts on this?
Intermediate & Advanced SEO | | LiamMcArthur0 -
Best Way to Incorporate FAQs into Every Page - Duplicate Content?
Hi Mozzers, We want to incorporate a 'Dictionary' of terms onto quite a few pages on our site, similar to an FAQ system. The 'Dictionary' has 285 terms in it, with about 1 sentence of content for each one (approximately 5,000 words total). The content is unique to our site and not keyword stuffed, but I am unsure what Google will think about us having all this shared content on these pages. I have a few ideas about how we can build this, but my higher-ups really want the entire dictionary on every page. Thoughts? Image of what we're thinking here - http://screencast.com/t/GkhOktwC4I Thanks!
Intermediate & Advanced SEO | | Travis-W0 -
Best way to remove low quality paginated search pages
I have a website that has around 90k pages indexed, but after doing the math I realized that I only have around 20-30k pages that are actually high quality, the rest are paginated pages from search results within my website. Every time someone searches a term on my site, that term would get its own page, which would include all of the relevant posts that are associated with that search term/tag. My site had around 20k different search terms, all being indexed. I have paused new search terms from being indexed, but what I want to know is if the best route would be to 404 all of the useless paginated pages from the search term pages. And if so, how many should I remove at one time? There must be 40-50k paginated pages and I am curious to know what would be the best bet from an SEO standpoint. All feedback is greatly appreciated. Thanks.
Intermediate & Advanced SEO | | WebServiceConsulting.com0 -
Best-of-the-web content in steep competition, ecommerce site
Hello, I'm helping my client write a long, comprehensive, best-of-the-web piece of content. It's a boring ecommerce niche, but on the informational side the top 10 competitors for the most linked to topic are all big players with huge domain authority. There's not a lot of links in the industry, should I try to top all the big industries through better content (somehow), pictures, illustrations, slideshows with audio, and by being more thorough than these very good competitors? Or should I go for something that's less linked to (maybe 1/5 as much people linking to it) but easier? or both? We're on a short timeline of 3 and 1/2 months until we need traffic and our budget is not huge
Intermediate & Advanced SEO | | BobGW1 -
How would you handle 12,000 "tag" pages on Wordpress site?
We have a Wordpress site where /tag/ pages were not set to "noindex" and they are driving 25% of site's traffic (roughly 100,000 visits year to date). We can't simply "noindex" them all now, or we'll lose a massive amount of traffic. We can't possibly write unique descriptions for all of them. We can't just do nothing or a Panda update will come by and ding us for duplicate content one day (surprised it hasn't already). What would you do?
Intermediate & Advanced SEO | | M_D_Golden_Peak1 -
Best way to re-order page elements based on search engine users
Both versions of the page has essentially same content, but in different order. One is for users coming from Google (and google bot) and other is for everybody else. Questions: Is it cloaking? what will be the best way to re-order elements on the page: totally different style sheets for each version, or calling in different divs in a same style sheet? Is there any better way to re-order elements based on search engine? Let me make it clear again: the content is same for everyone, just in different order for visitors coming from Google and everybody else. Don't ask me the reason behind it (executive orders!!)
Intermediate & Advanced SEO | | StickyRiceSEO0 -
Question about "launching to G" a new site with 500000 pages
Hey experts, how you doing? Hope everything is ok! I'm about to launch a new website, the code is almost done. Totally fresh new domain. The site will have like 500000 pages, fully internal optimized of course. I got my taticts to make G "travel" over my site to get things indexed. The problem is: to release it in "giant mode" or release it "thin" and increase the pages over the time? What do you recomend? Release the big G at once and let them find the 500k pages (do they think this can be a SPAM or something like that)? Or release like 1k/2k per day? Anybody know any good aproach to improve my chances of success here? Any word will be apreciated. Thanks!
Intermediate & Advanced SEO | | azaiats20