What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Webmaster tools -Fixing over 20,000+ crawl errors
Hi, I'm trying to gather all the 404 crawl errors on my website after a recent hacking that I've been trying to rectify and clean up. Webmaster tools states that I have over 20 000+ crawl errors. I can only download a sample of 1000 errors. Is there any way to get the full list instead of correcting 1000 errors, marking them as fixed and waiting for the next batch of 1000 errors to be listed in Webmaster tools? The current method is quite timely and I want to take care of all errors in one shot instead of over a course of a month.
Intermediate & Advanced SEO | | FPK0 -
Best practices for structuring an ecommerce site
I'm revamping my wife's ecommerce site. It is currently a very low traffic website that is not indexed very well in Google. So, my plan is to restructure it based upon the best practices that helps me avoid duplicate content penalties, and easier to index strategies. The store has about 7 types of products. Each product has approximately 30 different size variations that are sometimes specifically searched for. For example: 20x10x1 air filters, 20x10x2 air filters, 20x10x1 allergy reducing air filters, etc So, is it best for me to create 7 different products with 30 different size variations (size selector at the product level that changes the price) or is it better to create 210 different product pages, one for each style/size?
Intermediate & Advanced SEO | | pherbio0 -
What to do about similar product pages on major retail site
Hi all, I have a dilemma and I'm hoping the community can guide me in the right direction. We're working with a major retailer on launching a local deals section of their website (what I'll call the "local site"). The company has 55 million products for one brand, and 37 million for another. The main site (I'll call it the ".com version") is fairly well SEO'd with flat architecture, clean URLs, microdata, canonical tag, good product descriptions, etc. If you were looking for a refrigerator, you would use the faceted navigation and go from department > category > sub-category > product detail page. The local site's purpose is to "localize" all of the store inventory and have weekly offers and pricing specials. We will use a similar architecture as .com, except it will be under a /local/city-state/... sub-folder. Ideally, if you're looking for a refrigerator in San Antonio, Texas, then the local page should prove to be more relevant than the .com generic refrigerator pages. (the local pages have the addresses of all local stores in the footer and use the location microdata as well - the difference will be the prices.) MY QUESTION IS THIS: If we pull the exact same product pages/descriptions from the .com database for use in the local site, are we creating a duplicate content problem that will hurt the rest of the site? I don't think I can canonicalize to the .com generic product page - I actually want those local pages to show up at the top. Obviously, we don't want to copy product descriptions across root domains, but how is it handled across the SAME root domain? Ideally, it would be great if we had a listing from both the .com and the /local pages in the SERPs. What do you all think? Ryan
Intermediate & Advanced SEO | | RyanKelly0 -
Hey guys i have this issues on my crawling report what should i do to exlude the pages? are d
Overly-Dynamic URL Overly-Dynamic URL Although search engines can crawl dynamic URLs, search engine representatives have warned against using over 2 parameters in a given URL. Search engines may also see dynamic versions of the same URL as unique URLs, creating duplicate content.
Intermediate & Advanced SEO | | adulter0 -
Category Pages up - Product Pages down... what would help?
Hi I mentioned yesterday how one of our sites was losing rank on product pages. What steps do you take to improve the SERPS of product pages, in this case home/category/product is the tree. There isn't really any internal linking, except one link from the category page to each product, would setting up a host of internal links perhaps "similar products" linking them together be a place to start? How can I improve my ranking of these more deeply internal pages? Not just internal links?
Intermediate & Advanced SEO | | xoffie0 -
Best way to improve page rank
I notice many small business sites seems to have a page rank of 3,4, or 5 which don't appear to be doing a great deal of SEO on their websites. i.e these are very basic sites with a little static content that rarely changes, no blogs or particular links. Does having a high page rank still mean your will achieve better search engine positions? whats the best way to improve page rank for small business sites? thanks
Intermediate & Advanced SEO | | Bristolweb0 -
What are best SEO practices for product pages of unique items when the item is no longer available?
Hello, my company sells used cars though a website. Each vehicle page contains photos and details of the unit, but once the vehicle is sold, all the contents are replaced by a simple text like "this vehicle is not available anymore".
Intermediate & Advanced SEO | | Darioz
Title of the page also change to a generic one.
URL remains the same. I doubt this is the correct way of doing, but I cannot understand what method would be better. The improvement I am considering for pages of no longer available vehicles is this: keep the page alive but with reduced vehicle details, a text like: this vehicles is not available anymore and automatic recommendations for similar items. What do you think? Is this a good practice or do you suggest anything different? Also, should I put a NOINDEX tag on the expired vehicles pages? Thank you in advance for your help.0 -
What would cause a drastic drop in pages crawled per day?
The site didn't go down. There were no drop in rankings, or traffic. But we went from averaging 150,000 pages crawled per day, to ~1000 pages crawled per day. We're now back up to ~100,000 crawled per day, but we went more than a week with only 1000 pages being crawled daily. The question is, what could cause this drastic (but temporary) reduction in pages crawled?
Intermediate & Advanced SEO | | Fatwallet0