What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How do you analyze a traffic drop with no historic Google Analytics data?
A client of mine has a large website with multiple sections (shop, forums, articles, etc.) that apparently had a significant reduction in rankings, traffic, and sales in the past. However, historic Google Analytics data is not available for the site, and I'm having troubles identifying anything concrete about the traffic drop, such as when it happened, what pages/sections it happened to, etc. The shop traffic drives most of the revenue, but it's a small number compared to the forums traffic, so it's hard to pick anything out of top-line trends like SEMrush offers. What tools or strategies might help in this situation?
Reporting & Analytics | | AdamThompson0 -
Does subdomain (or sub sub domain) affect analytics data of root site?
We self-host our public website, but over time have also added subdomains onto it that are not public and are for internal or even client portals. I am seeking advice as to whether those subdomains affect the analytics data (self referrals, visits, bounces) of the public site that I am tasked with analyzing. I feel that it does skew the data but need to build a solid case to move the public website to a new domain, so as to leave the existing one in tact with all of its subs.
Reporting & Analytics | | MarketingGroup0 -
WMT data vs. Analytics
Hi Each month I export my data from WMT and go through analytics. I also export our non brand queries from analytics and not WMT - I haven't had an issue before, but this month the impression data is quite different. In the hundreds of thousands different for keywords, everything seems to have taken a big jump and it seems strange. However, not everything is different, I've spot checked some and its; consistent in both, I'm not sure what's going on? One example would be: <colgroup><col width="281"> <col width="72"></colgroup>
Reporting & Analytics | | BeckyKey
| industrial shelving | 1016 |
| industrial racking | 999 | These appear as impressions from Query data in analytics, but they appear nowhere in my WMT query data. Analytics query data shows: | industrial equipment | 670 | WMT Data: | industrial equipment | 143 | Anyone have any idea? Perhaps some kind of tracking issue? Also I've triple checked dates etc...0 -
Moving data between Google Analytics Properties
Last summer we setup another Google Analytics property for us with Universal Analytics and have been running this alongside the old Google Analtyics property. is there a way of exporting all the old data from the old property into the new Universal Analytics property?
Reporting & Analytics | | ese0 -
Google Making all searches secure - "Not provided" data to increase in Analytics
A lot of you might already be aware of the recent Google change at encrypting all search activity except for clicks on ads. Rand did a whiteboard session on this recently. How is everyone planning to adjust their research data to accommodate for this change?
Reporting & Analytics | | SEO5Team0 -
No data available for example.com in WMT. What to do?
Hi, Our problem is simple: we have statistics data for www.example.com but some data is missing for example.com (eg."links to your site", "structured data, "html improvements") . However, "search queries", "index status" and some other data is available for example.com. The problem is that we have over 5000 subdomains and we see no information about them.(especially links pointing to them). We followed every advice given by Google but doesn't seem to work: -Adding www.example.com and example.com in WMT -Setting www.example.com as the preffered domain -Using DNS verification to verify our site What do we have to do? Thank you, Axello
Reporting & Analytics | | axello0 -
Need Tips to Track Advertising of Domain on my Car
Hi, I'm working on building a new site. http://www.pilatesboisfranc.com/ I also bought the domain .ca (the business will be in canada) .net .org and .info All the domains are redirect to the .com This site is about my wife new business, a Pilates Studio. We would like to advertise the site on our personal car, using vinyl letters to display the domain name. Is there a way tu use the (.ca) http://www.pilatesboisfranc.ca advertise on my car and track that advertising using Analytics? I know I can use a URL something like http://www.pilatesboisfranc.com/health and track the hit, but using a shorter URL is the key. Can you help? Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050 -
How can I track search engine optimization data in Google analytics?
My website is linked to a Google Analytics web property. But, I am not able to track search engine optimization data in Google Analytics. So, How can I get it done?
Reporting & Analytics | | CommercePundit0