What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What’s the best way to handle multiple website languages in terms of metatags that should be used and pages sent on our sitemap?
Hey everyone, Has anyone here worked with SEO + website translations? When should we use canonical or alternate tag if we want the user to find our page on the language he used on Google? Should we send all pages on all the different locales on the sitemap? Looking forward to hearing from you! Thanks!
Intermediate & Advanced SEO | | allanformigoni0 -
On 1 of our sites we have our Company name in the H1 on our other site we have the page title in our H1 - does anyone have any advise about the best information to have in the H1, H2 and Page Tile
We have 2 sites that have been set up slightly differently. On 1 site we have the Company name in the H1 and the product name in the page title and H2. On the other site we have the Product name in the H1 and no H2. Does anyone have any advise about the best information to have in the H1 and H2
Intermediate & Advanced SEO | | CostumeD0 -
Ranking Page - Category vs. Blog Post - What is best for CTR?
Hi, I am not sure wether I shall rank with a category page, or create a new post. Let me explain... If I google for 'Basic SEO' I see an article from Rand with Authorship markup. That's cool so I can go straight to this result because I know there might be some good insight. BUT: 'Basic SEO' is also an category at MOZ an it is not ranking. On the other hand, if I google for 'advanced SEO' then the MOZ category for 'advanced SEO' is ranking. But there is no authorship image, so users are much less likely to click on that result. Now, I want to rank for a very important keyword for me (content keyword, not transactional). Therefor, I have a category called 'yoga exercises'. But shall I rather create an post about them only to increase CTR due to Google Authorship? I read in Google guidelines that Authorship on homepage an category pages are not appreciated. Hope you have some insights that can help me out.
Intermediate & Advanced SEO | | soralsokal0 -
External resources page (AKA a satellite site) - is it a good idea?
So the general view on satellite sites is that they're not worth it because of their low authority and the amount of link juice they provide. However, I have an idea that is slightly different to the standard satellite site model. A client's website is in a particular niche, but a lot of websites that I have identified for potential links are not interested because they are a private commercial company. Many are only interested in linking to charities or simple resource pages. I created a resource section on the website, but many are still unwilling to link to it as it is still part of a commercial website. The website is performing well and is banging on the door of page one for some really competitive keywords. A few more links would make a massive difference. One idea I have is to create a standalone resource website that links to our client's website. This would be easy to get links from sites that would flat out refuse to link to the main website. This would increase the authority of the resource and result in more link juice to the primary website. Now I know that the link juice from this website will not be as good as getting links directly to the primary website, but would it still be a good idea? Or would my time be better spent trying to get a handful of links directly to the client's website? Alternatively, I could set up a sub-domain to set up the resource, but I'm not sure that this would be as successful.
Intermediate & Advanced SEO | | maxweb0 -
Does Google make continued attempts to crawl an old page one it has followed a 301 to the new page?
I am curious about this for a couple of reasons. We have all dealt with a site who switched platforms and didn't plan properly and now have 1,000's of crawl errors. Many of the developers I have talked to have stated very clearly that the HTacccess file should not be used for 1,000's of singe redirects. I figured If I only needed them in their temporarily it wouldn't be an issue. I am curious if once Google follows a 301 from an old page to a new page, will they stop crawling the old page?
Intermediate & Advanced SEO | | RossFruin0 -
Do 404 pages pass link juice? And best practices...
Last year Google said bad links to 404 pages wouldn't hurt your site. Could that still be the case in light of recent Google updates to try and combat spammy links and negative SEO? Can links to 404 pages benefit a website and pass link juice? I'd assume at the very least that any link juice will pass through links FROM the 404 page? Many websites have great 404 pages that get linked to: http://www.opensiteexplorer.org/links?site=http%3A%2F%2Fretardzone.com%2F404 - that was the first of four I checked from the "60 Really Cool...404 Pages" that actually returned the 404 HTTP Status! So apologies if you find the word 'retard' offensive. According to Open Site Explorer it has a decent Page Authority and number of backlinks - but it doesn't show in Google's SERPs. I'd never do it, but if you have a particularly well-linked to 404 page, is there an argument for giving it 200 OK Status? Finally, what are the best practices regarding 404s and address bar links? For example, if
Intermediate & Advanced SEO | | Alex-Harford
www.examplesite.com/3rwdfs returns a 404 error, should I make that redirect to
www.examplesite.com/404 or leave it as is? Redirecting to www.examplesite.com/404 might not be user-friendly as people won't be able to correct the URL in the address bar. But if I have a great 404 page that people link to, I don't want links going to loads of random pages do I? Is either way considered best practice? If I did a 301 redirect I guess it would send the wrong signal to the crawlers? Should I use a 302 redirect, or even a 304 Not Modified redirect?1 -
What is the best way to allow content to be used on other sites for syndication without taking the chance of duplicate content filters
Cookstr appears to be syndicating content to shape.com and mensfitness.com a) They integrate their data into partner sites with an attribution back to their site and skinned it with the partners look. b) they link the image back to their image hosted on cookstr c) The page does not have microformats or as much data as their own page does so their own page is better SEO. Is this the best strategy or is there something better they could be doing to safely allow others to use our content, we don't want to share the content if we're going to get hit for a duplicate content filter or have another site out rank us with our own data. Thanks for your help in advance! their original content page: http://www.cookstr.com/recipes/sauteacuteed-escarole-with-pancetta their syndicated content pages: http://www.shape.com/healthy-eating/healthy-recipes/recipe/sauteacuteed-escarole-with-pancetta
Intermediate & Advanced SEO | | irvingw
http://www.mensfitness.com/nutrition/healthy-recipes/recipe/sauteacuteed-escarole-with-pancetta0 -
Best way to improve page rank
I notice many small business sites seems to have a page rank of 3,4, or 5 which don't appear to be doing a great deal of SEO on their websites. i.e these are very basic sites with a little static content that rarely changes, no blogs or particular links. Does having a high page rank still mean your will achieve better search engine positions? whats the best way to improve page rank for small business sites? thanks
Intermediate & Advanced SEO | | Bristolweb0