What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Help! Need to Get Traffic Back Up in Saturated Market
I was looking in one of my client's Google Analytics profiles, and noticed that they had two major drops in traffic before we started working with them—and they've never really recovered. The first, and most significant drop was around January 2015. And then the again, but not as drastic of a drop, was around September 2015. They are a heating and cooling company, but they are located out west so this shouldn't be a seasonality thing. Here is a link to what the BIG drop is traffic looks like in January 2015: http://imgur.com/a/Y1s8U To get a clearer picture, here are the numbers for the overall website traffic:
Reporting & Analytics | | BlueCorona
September 1, 2015 - September 30, 2015: 30,923 sessions
September 1, 2015 - September 30, 2016: 13,768 sessions Year over year traffic to the website dropped by 55%. Here is a link to what year over year looks like in Google Analytics: http://imgur.com/a/TPdJQ Like I said, we weren't working with them at the time so I don't know specifics about what might have caused this, but their numbers have never even come close to reaching what they used to be prior to the noticeable drop after September 2015. Does anyone have any insights into why this might be? Was there an algorithm change back then that could still be impacting them? Any ideas how to get them back to where they once were? Any input is greatly appreciated! Thanks.1 -
Can you tell MUV data on websites using MOZ?
I want to write reports on other websites and need to know MUV data on them
Reporting & Analytics | | WeAreVillage0 -
The curse of (not provided) data....
Buongiorno from 23 degrees C Wetherby UK 🙂 Do you ever get the impression Google doesnt Like SEO practitioners? Thing is the (not provided) snag in the key word Analytics data is a complete pain in arse. Yes you can go into webmaster tools and get a feel for organic keyword data but the joy stops abruptly when you need a full picture of traffic acquisition from a specific keyword. So my question is please:
Reporting & Analytics | | Nightwing
"When a client asks, give me traffic data acquired from an organic phrase". How on earth can you give an accurate answer? And to add salt into the wound the traffic data is going to be less so your SEO efforts are going to take a hit". Is the answer use another analytics service?
Grazie tanto,
David0 -
GA Regex Help Needed!
Hi mozzers, I am just discovering the RegEx world so I need your help. I would like to create an advanced segment to track 40 specific pages that have very different URIs from each other. Can you tell me how using RegEx? The dimension would be "landing pages" then not sure... Thanks
Reporting & Analytics | | Ideas-Money-Art0 -
Enabling Webmaster Tools data within Analytics
Hello, Im having a hard time connecting webmaster tools within Google Analytics i want to be able to see search queries in GA That's what Google tells me to do : "You can visit the Property Settings page in Analytics account management to change which of your Webmaster Tools sites' data you wish to show, and control which profiles on your Web Property have access to view the data." I cant find "Property Settings Page" in google analytics, or anything that has to do with "Webmaster tools" I was wandering if you can help me on that 🙂 Thanks
Reporting & Analytics | | tonyklu0 -
Newbie Need Step by Step to Track .ca Domaine Redirect from GoDaddy to .com
I, I ask a few time about how to track using Google Analytics, my domain, www.pilatesboisfranc.ca bought at GoDaddy and redirect from the GoDaddy control panel to my domain, http://www.pilatesboisfranc.com/ I don't know anything about coding or webdesign, I did this web site from a theme on wordpress for my wife opening this Pilates Studio in our neighbourhood soon. http://www.pilatesboisfranc.ca/ is advertise on our car.(it will be nice to see if that advertising is worth it!) http://www.pilatesboisfranc.ca/ is redirect from GoGDaddy control panel to the site: http://www.pilatesboisfranc.com/ I had a few answers on this forum, but I'm not sure how to do this. My knowledges are very limited in html and all technical side. Thank to WordPress, Lynda.com and Theme Forest. Those are the tools I took to built this basic web site. Can any body help me track this .ca? I will need a step by step guide to achieve my goal. Google Analytics is instal on the site. Any help will be really appreciated. Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050 -
If I change the URL of a page, but the old page canonicalizes to the new, do I need to change my Analytics goals to get data?
I changed the URLs of some pages recently (because the same thing that affects the internal anchor text also affects the URL - grr...) but considered it not a big deal because even if I looked at the source code of the old URL, the canonical tag was now pointing to the new one. The question is - if I had URL destination goals set up for those URLs in Google Anlaytics, do I now have to change them? Or does Google somehow know that anyone getting to the new URL is the equivalent of someone getting to the old URL because of the canonical tag that exists on the old URL source code? I still do see goal conversions for some of the old URLs even since I changed them - but it could be that people are still somehow finding the old URL somewhere - or that Google only reindexed it a week or so after I made the change. Any light to shed? Thanks in advance, Aviva B
Reporting & Analytics | | debi_zyx0 -
Data Overload! So . . . ..Confused . . .. !
Just when I think I'm starting to get it . . . info comes in that blows my mind. I'm using several tools at SeoMoz and others trying to keep track of my link building success, which has been largely me registering at related industries bulletin boards and forums and posting as well as making sure we are listed properly in any and all sites listing companies in our industry etc. I have to say I was a little frustrated with the results. I had seen SOME increase in the back links, but almost no move in the number of linking domains. NOW I just logged into googles webmaster tools . . . and their data says I have more than DOUBLE the number of linking domains as any other sites are suggesting. What's going on here? I can understand slight discrepancies due to when they crawl the data etc . .. but 22 linking domains compared to 9-10 from everyone else? What the heck?
Reporting & Analytics | | damon12120