What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
"Not enough data for selected time period" - Google My Business Insights
Hello All, Has anyone come across the following message "Not enough data for selected time period" when they choose a time frame in Google My Business (GMB) for one week. It seems to be consistent with "Customer Actions" and "Where Customers View Your Business on Google sections. I'm confused because I am pulling in a minimum of 500 searches a week according to the "How customers search for your business" section. I can scroll over top of each day in these areas and manually write down the results, but what's the point of offering a 'one week" insight if it won't calculate for you. If anyone has had the same issue or knows what's going on, please let me know. Thanks!
Reporting & Analytics | | MainstreamMktg1 -
How can I overwrite previously imported cost data in Google Analytics?
I choose the overwrite option but it keeps adding the costs to the existing costs. When i select that radio button and click done, that option does not appear on the summary screen. That section next to Import Behavior is blank.
Reporting & Analytics | | JGB5550 -
What's more accurate? GA queries data or Moz/SEMRush keyword data for rankings
What do you guys think? What's more accurate? GA queries data or Moz/SEMRush keyword data for rankings? Any thoughts appreciated.
Reporting & Analytics | | znotes0 -
Possible penalty question - need expert help
hallo everyone, I am posting this question to the MOZ community, because I could not find any useful information or proper advice so far, even after consulting a few local SEO experts. I noticed from the end of september a steady and consistent decrease in visits (please see attached pdf) for my website https://bastabollette.it I lost so far almost 40%. Please consider that I have not changed my habits in blog posting lately, both in quantity and quality. I have not made any subtantial change on the website lately. I did a general audit of the site asking to an expert but apart from some generic suggestions (like: "work on increasing PR, add more quality backliks, use more no-follow links, fix broken links" - things I am currently going to fix anyway) I don't really understand the reason of the drop. Please also note the strange drop of 11/22/15 (see search console screenshot). Can you please help me? thank you. Selezione_018.jpg Selezione_019.jpg
Reporting & Analytics | | micvitale0 -
Analytics code removed and still collecting data
Google analytics code was removed from a website and then it started tracking a couple of days later to only stop again? How can that happen? Has the developer not removed the old code properly? Can the code be injected remotely?
Reporting & Analytics | | GardenBeet0 -
Reported data in Multi-channel funnels in GA wrong?
I'd love to start using the Multi-channel funnels feature in Google Analytics but I have zero confidence in the reported data as it seems to bear no relationship to the standard ecomm reports. To be specific, in August 2013, MCF is reporting the following for email campaigns: Assisted conversions = 20
Reporting & Analytics | | Bluesnapper
Assisted conversion value = £1,405.91
Last Click or Direct Conversions = 14
Last Click or Direct Conversion Value = £369.57 Now switching to the traffic sources report, GA reports for the email campaigns revenue was £1,226,41 over the same period across 21 transactions. My interpretation of this is that the email campaigns were the "last-click" and delivered £1,226,41 and 21 "conversions" (as I have no goals configured in this GA view, conversions = transactions I believe) That leaves the MCF last-click report short by £856.84 (£1,226,41 - £369.57) I can see no reason why this should be so unless I'm not interpreting the data correctly! Anybody have any suggestions/ideas as to what's going on? Any help appreciated.0 -
Enabling Webmaster Tools data within Analytics
Hello, Im having a hard time connecting webmaster tools within Google Analytics i want to be able to see search queries in GA That's what Google tells me to do : "You can visit the Property Settings page in Analytics account management to change which of your Webmaster Tools sites' data you wish to show, and control which profiles on your Web Property have access to view the data." I cant find "Property Settings Page" in google analytics, or anything that has to do with "Webmaster tools" I was wandering if you can help me on that 🙂 Thanks
Reporting & Analytics | | tonyklu0 -
Moved Up in SERPS & Traffic, Need Help Converting
Hello, After listening to the advice of many of you on this forum, I have managed to move my site up in the SERPS, close enough to where I want/need to be. My traffic has increased heavily, yet I am still not seeing a large increase in orders being placed. I am positive that I have the lowest prices on these items, and the most information available about them, yet I still can't seem to convert a lot of this traffic into sales. Can you guys please take a look at my site and provide some guidance on what I can/should do to help convert these visitors to customers? my site is : http://goo.gl/JgK1e Thanks
Reporting & Analytics | | Prime850