What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need your Opinion on Bounce Rate Analysis
I'm currently doing a bounce rate analysis for our resource pages. These are information article pages - mix of plain texts and those containing either images, infographics, videos or even podcasts. By the way, I did search for bounce rate topics here, but I felt like i still need to post this. Unless I've overlooked a similar post, my apologies. It's a first for me to do an in-depth BR analysis, so I need to clarify few things. What is a good or bad range bounce rate? Is there even a range comparison? Like when can you say a bounce rate is high for an information type page? I've read some stuff online but they're confusing. What other Analytics factors should I consider looking at together with bounce rate? For pages (which purposely educate visitors) with high bounce rate, can you guys suggest tips to improve it? I would appreciate and value any advise. Thanks a lot!
Reporting & Analytics | | ktrich1 -
No Search Data in Google Search Console (Search Analytics)
Wondering if anyone has experienced 0 data in their Google Search Console (Search Analytics) and found possible solutions to retrieve data? We have a few clients who, prior to the update to Google Search Console, were getting data regularly in terms of the Search Queries report, but ever since the update to Google Search Console, they are no longer receiving data. As an FYI, both the www and non-www versions of the website are verified in Search Console and the XML Sitemaps and Robots.txt files are clean, tested and working fine. Any insights or experience of sites showing 0 data in Search Analytics? Any possible solutions would be greatly appreciated. Thanks!
Reporting & Analytics | | SEO5Team0 -
Tag Manager & Universal Analytics Code - Do you need both?
Hi Mozzers I've created a container for a domain in Google Tag Manager. Within that container I've created a tag for universal analytics with track type "Page view" and the firing rule "all pages". Can I then replace the Universal Analytics code with the tag manager code? Would it still track all the normal data in Google Analytics? There are no events setup up yet so that's not a concern but there are goals setup tracking which are triggered by a page view. Would they be affected? Thanks Anthony
Reporting & Analytics | | Tone_Agency1 -
We lost great amount of Google Traffic. Need Expert Advice, Please!
We are in the business of selling home and commercial light fixtures for about 10 years now. Our website is a very large ecommerce website with more than 40K pages including category, sub-category and product pages. We have been getting decent organic traffic mostly from highly competitive keywords and also from product/solution specific long tail keywords. Recently when Google changed the EMD algo we have seen a dip in the traffic (say about 60%). I can't be so sure that this is because of EMD update, but it started happening only after this update. There has been a rank drop from 1st page to 2nd page, decrease in no. of keywords driving traffic, decrease in no. of pages driving traffic, all these things have an negative impact on our organic revenue. I know that our back link portfolio is bad and the reason behind this is the SEO companies that we previously worked with, Thanks to them for this sloppy work. Other than back links, Is there anything fundamentally wrong on our website. Here is the URL http://bit.ly/QVFHgr
Reporting & Analytics | | goldenageusa0 -
Viewing 'overall' data for multiple Google Analytics accounts
Is there any way you can view data from all of your Google Analytics accounts? For example, if I wanted to view know how much mobile traffic all my sites had, could I do this? Rather than just looking at each site individually. Thanks
Reporting & Analytics | | intSchools0 -
Do i need a new dedicated server to increase my website speed
Hi, i have been talking to my hosting company about my site. I am having major problems with the speed of the site. My site is www.in2town.co.uk Ever since i had to redesign my site after a major mistake was made by the hosting company, my site has been running slow and i have tried everything to sort this out including moving to a dedicated server. The trouble is nothing is working and now my hosting companny have told me that i need a new dedicated server which will make it faster. My site is in joomla and the hosting company have told me that the dedicated server below will make the site run faster, but shall i trust them or find another hosting company. Intel i3 540 3.06 Ghz HT 4MB S-Cache $219/mo $289/month500GBStorage6GBRAM10TBBandwidthI am using the following to test the speed of my site http://tools.pingdom.com/fpt/#!/r0spOGObd/www.in2town.co.uk and http://gtmetrix.com/reports/www.in2town.co.uk/kVV1mTDcThe trouble i have is, when you try loading the home page it is slow and when you try moving around the site it is slow.Can anyone please give me some advice.
Reporting & Analytics | | ClaireH-1848860 -
Moved Up in SERPS & Traffic, Need Help Converting
Hello, After listening to the advice of many of you on this forum, I have managed to move my site up in the SERPS, close enough to where I want/need to be. My traffic has increased heavily, yet I am still not seeing a large increase in orders being placed. I am positive that I have the lowest prices on these items, and the most information available about them, yet I still can't seem to convert a lot of this traffic into sales. Can you guys please take a look at my site and provide some guidance on what I can/should do to help convert these visitors to customers? my site is : http://goo.gl/JgK1e Thanks
Reporting & Analytics | | Prime850 -
Analytics giving crazy impossible data?
When I look at my Analytics using any of my segments, they don't work. It shows zero visits for the segment until April 30th, then the visitors for the segment shoots up to above the number for all visits! Anyone else experiencing this bizarre data?
Reporting & Analytics | | mascotmike0