What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
I am temporarily moving a site to a new domain. Which redirect is best?
A client is having their site redeveloped on a new platform in sections and are moving the sections that are on the new platform to a temporary subdomain until the entire site is migrated. This is happening over the course of 2-3 months. During this time, is it best for the site to use 302 temporary redirects during this time (URL path not changing), or is it best to 301 to the temp. domain, then 301 back to the original once the new platform is completely migrated? Thanks!
Intermediate & Advanced SEO | | Matt3120 -
Best Practices for Title Tags for Product Listing Page
My industry is commercial real estate in New York City. Our site has 300 real estate listings. The format we have been using for Title Tags are below. This probably disastrous from an SEO perspective. Using number is a total waste space. A few questions:
Intermediate & Advanced SEO | | Kingalan1
-Should we set listing not no index if they are not content rich?
-If we do choose to index them, should we avoid titles listing Square Footage and dollar amounts?
-Since local SEO is critical, should the titles always list New York, NY or Manhattan, NY?
-I have red that titles should contain some form of branding. But our company name is Metro Manhattan Office Space. That would take up way too much space. Even "Metro Manhattan" is long. DO we need to use the title tag for branding or can we just focus on a brief description of page content incorporating one important phrase? Our site is: w w w . m e t r o - m a n h a t t a n . c o m <colgroup><col width="405"></colgroup>
| Turnkey Flatiron Tech Space | 2,850 SF $10,687/month | <colgroup><col width="405"></colgroup>
| Gallery, Office Rental | Midtown, W. 57 St | 4441SF $24055/month | <colgroup><col width="405"></colgroup>
| Open Plan Loft |Flatiron, Chelsea | 2414SF $12,874/month | <colgroup><col width="405"></colgroup>
| Tribeca Corner Loft | Varick Street | 2267SF $11,712/month | <colgroup><col width="405"></colgroup>
| 275 Madison, LAW, P7, 3,252SF, $65 - Manhattan, New York |0 -
Is there an advantage to using rel=canonical rather than noindex on pages on my mobile site (m.company.com)?
Is there an advantage to using link rel=alternate (as recommended by Google) rather than noindex on pages on my mobile site (m.company.com)? The content on the mobile pages is very similar to the content on the desktop site. I see Google recommends canonical and alternate tags, but what are the benefits of using those rather than noindex?
Intermediate & Advanced SEO | | jennifer.new0 -
Linking from a corporate site to a brand site.
Is there an SEO impact to a large corporation linking from a corporate and/or a divisional site to a specific brand site with it's own top level domain? We would like to keep the traffic coming, but not if it will be seen as a black hat tactic. My guess is that Google will be smart enough to see that the corporation owns the brand and at least not penalize us, but I am wondering if anyone else has this experience? Google Analytics is calling it self-referral.
Intermediate & Advanced SEO | | mrbobland0 -
What's the best way to phase in a complete site redesign?
Our client is in the planning stages of a site redesign that includes moving platforms. The new site will be rolled out in different phases throughout a period of a year. They are planning to put the new site redesign on a subdomain (i.e. www2.website.com) during the roll out of the different phases while eventually switching the new site back over to the www domain once all the phases are complete. We’re afraid that having the new site on the www2 domain will hurt SEO. For example, if their first phase is rolling out a new system to customize a product design and this new design system is hosted on www2.website.com/customize, when a customer picks a product to customize they’ll be linked to www2.website.com/customize instead of the original www.website.com/customize. The old website will start to get phased out as more and more of the new website is completed and users will be directed to www2. Once the entire redesign is completed, the old platform can be removed and the new website moved back to the www subdomian. Is there a better way of rolling out a website redesign in phases and not have it hosted on a different subdomain?
Intermediate & Advanced SEO | | BlueAcorn0 -
Best tool to calculate link distribution?
What is the best tool to calculate the total link distribution throughout a site? I know opensiteexplorer.com's "top pages" breaks down the numbers for you? Are there any others?
Intermediate & Advanced SEO | | nicole.healthline0 -
Tool to calculate the number of pages in Google's index?
When working with a very large site, are there any tools that will help you calculate the number of links in the Google index? I know you can use site:www.domain.com to see all the links indexed for a particular url. But what if you want to see the number of pages indexed for 100 different subdirectories (i.e. www.domain.com/a, www.domain.com/b)? is there a tool to help automate the process of finding the number of pages from each subdirectory in Google's index?
Intermediate & Advanced SEO | | nicole.healthline0 -
Best approach to launch a new site with new urls - same domain
www.sierratradingpost.com We have a high volume e-commerce website with over 15K items, an average of 150K visits per day and 12.6 pages per visit. We are launching a new website this spring which is currently on a beta sub domain and we are looking for the best strategy that preserves our current search rankings while throttling traffic (possibly 25% per week) to measure results. The new site will be soft launched as we plan to slowly migrate traffic to it via a load balancer. This way we can monitor performance of the new site while still having the old site as a backup. Only when we are fully comfortable with the new site will we submit the 301 redirects and migrate everyone over to the new site. We will have a month or so of running both sites. Except for the homepage the URL structure for the new site is different than the old site. What is our best strategy so we don’t lose ranking on the old site and start earning ranking on the new site, while avoiding duplicate content and cloaking issues? Here is what we got back from a Google post which may highlight our concerns better: http://www.google.com/support/forum/p/Webmasters/thread?tid=62d0a16c4702a17d&hl=en&fid=62d0a16c4702a17d00049b67b51500a6 Thank You, sincerely, Stephan Woo Cude SEO Specialist scude@sierratradingpost.com
Intermediate & Advanced SEO | | STPseo0