What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Paginated Pages Page Depth
Hi Everyone, I was wondering how Google counts the page depth on paginated pages. DeepCrawl is showing our primary pages as being 6+ levels deep, but without the blog or with an infinite scroll on the /blog/ page, I believe it would be only 2 or 3 levels deep. Using Moz's blog as an example, is https://moz.com/blog?page=2 treated to be on the same level in terms of page depth as https://moz.com/blog? If so is it the https://site.comcom/blog" /> and https://site.com/blog?page=3" /> code that helps Google recognize this? Or does Google treat the page depth the same way that DeepCrawl is showing it with the blog posts on page 2 being +1 in page depth compared to the ones on page 1, for example? Thanks, Andy
Intermediate & Advanced SEO | | AndyRSB0 -
Best Practices to Design Site Mock Up Using Wordpress Rather than Wireframes?
We are in the process of redesigning our real estate website. Our designer/developer is very quick and confident on Wordpress. He suggests designing directly on Wordpress and bypassing wireframes and a mock ups. He is very confident in his Wordpress abilities. Is it a mistake to take this approach? He has also asked that we select a real estate theme at this point. I would think that the theme would be selected after the wireframes and mock ups get done. But there are certainly different approaches. Are there best practices for redesigning a webiste; any suggestions? Are there significant risks/disadvantages to bypassing wireframes/mock ups? Thanks,
Intermediate & Advanced SEO | | Kingalan1
Alan Rosinsky0 -
Taxonomy question - best approach for site structure
Hi all, I'm working on a dentist's website and want some advice on the best way to lay out the navigation. I would like to know which structure will help the site work naturally. I feel the second example would be better as it would focus the 'power' around the type of treatment and get that to rank better. .com/assessment/whitening
Intermediate & Advanced SEO | | Bee159
.com/assessment/straightening
.com/treatment/whitening
.com/treatment/straightening or .com/whitening/assessment
.com/straightening/assessment
.com/whitening/treatment
.com/straightening/treatment Please advise, thanks.0 -
Best strategy to follow for a single service site
Can anyone share what they feel is the best strategy to follow for a single service site? Would you optimise and target the homepage for the primary service they offer or target a page one level lower and leave the homepage to target the Brand name? Links to any references or case studies would also be greatly appreciated, thank you!
Intermediate & Advanced SEO | | Marketing_Today0 -
Link Removal Request Sent to Google, Bad Pages Gone from Index But Still Appear in Webmaster Tools
| On June 14th the number of indexed pages for our website on Google Webmaster tools increased from 676 to 851 pages. Our ranking and traffic have taken a big hit since then. The increase in indexed pages is linked to a design upgrade of our website. The upgrade was made June 6th. No new URLS were added. A few forms were changed, the sidebar and header were redesigned. Also, Google Tag Manager was added to the site. My SEO provider, a reputable firm endorsed by MOZ, believes the extra 175 pages indexed by Google, pages that do not offer much content, may be causing the ranking decline. My developer submitted a page removal request to Google via Webmaster tools around June 20th. Now when a Google search is done for site:www.nyc-officespace-leader.com 851 results display. Would these extra pages cause a drop in ranking? My developer issued a link removal request for these pages around June 20th and the number in the Google search results appeared to drop to 451 for a few days, now it is back up to 851. In Google Webmaster Tools it is still listed as 851 pages. My ranking drop more and more everyday. At the end of displayed Google Search Results for site:www.nyc-officespace-leader.comvery strange URSL are displaying like:www.nyc-officespace-leader.com/wp-content/plugins/... If we can get rid of these issues should ranking return to what it was before?I suspect this is an issue with sitemaps and Robot text. Are there any firms or coders who specialize in this? My developer has really dropped the ball. Thanks everyone!! Alan |
Intermediate & Advanced SEO | | Kingalan10 -
Moving career site to new URL from main site. Will it hurt SEO for main page?
For one of our clients we are building a career site and putting it under a different URL and hosting service (mainly due to security concerns of hosting it under the same host and domain). almost 100% of the incoming traffic to their current career section (which it is in a sub-folder) receives traffic for branded keywords (brand + job/career/employment), that is, there are no job position specific keywords. The client is now worried that after moving the site, the inbound traffic to the main site will be severely affected as well as the SERP results. My questions are, will the non-career related SERPs be affected? I don't see how will they be but I could be wrong If no, how could we reassure her that the SEO to the main site wont be affected? are there any case studies of a similar case (splitting part of the website under a new URL and hosting service?) Thank you for your help. PS: this is my first post so please forgive me if this has been asked before. I could not find a good response.
Intermediate & Advanced SEO | | rflores0 -
What is the best strategy for linking to sub category pages?
My site is set up like this (i have x6 categories and all are similar) Home Page - Category - sub category - X4 detail pages My category page provides a summary/introduction of the subject, my sub category page is the "money page" with ability to quote & buy - my detail pages provide supporting material. What is the best internal linking strategy between these pages? (in addition, in one category i have x6 sub categories but only one of them is a "money page", should i be linking all of these pages back to the money page?) Thanks Ash
Intermediate & Advanced SEO | | AshShep10 -
Key page of site not ranking at all
Our site has the largest selection of dog clothes on the Internet. We're been (every so slowly) creeping up in the rankings for the "dog clothes" term, but for some reason only rank for our home page. Even though the home page (and every page on the domain) has links pointing to our specific Dog Clothes page, that page doesn't even rank anywhere when searching Google with "dog clothes site:baxterboo.com". http://www.google.com/webhp?source=hp&q=dog+clothes+site:baxterboo.com&#sclient=psy&hl=en&site=webhp&source=hp&q=dog+clothes+site:baxterboo.com&btnG=Google+Search&aq=f&aqi=&aql=&oq=dog+clothes+site:baxterboo.com&pbx=1&bav=on.2,or.r_gc.r_pw.&fp=f4efcaa1b8c328f Pages 2+ of product results from that page rank, but not the base page. It's not excluded in robots.txt, All on site links to that page use the same URL. That page is loaded with more text that includes the keywords. I don't believe there's duplicated content. What am I missing? Has the page somehow been penalized?
Intermediate & Advanced SEO | | BBPets0