What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to change domains in Google Analytics without losing the data
Hi there, We recently changed our domain from .COM to .NET so that all our subdomains from external pages matched. Right now in Google Console we have our new .NET website being tracked, but in GA we are still tracking .COM. It is also causing issues with MOZ crawling our site because of the .COM/.NET discrepancy. My question is what is the best way to change our Google Analytics from .COM to .NET without losing historical data and what considerations do we need to change before implementing this? Our team was concerned that just downloading the old data would be too vast and it we wouldn't be able to continue manipulating it dynamically in GA. Thanks!!
Reporting & Analytics | | cPanel-LLC.0 -
Does Google need Analytics installed to create metrics?
Hi Mozers, Does Google know time on site, number of page visits, bounce rate, etc. even if Google Analytics is NOT installed? Does it measure all that stuff anyway, and if you install the GA code on your site, that is so that YOU can see those metrics? OR can Google only see those metrics if you install GA on your site? Thanks! Jane
Reporting & Analytics | | CalamityJane771 -
Need help Taking my Site to the Next Level
Any, and I mean ANY suggestions would be Great and Welcome, Good and Bad. Have a Good site, with what I believe good content, and just stuck at the same level for over 6 months. Please and Thank you for taking a few minute of time out of your day on this. Joe https://www.surecretedesign.com/
Reporting & Analytics | | surecreteproucts0 -
Search console Search Analytics devices not showing mobile and tablet data since July 29th, have anyone noticed that too?
If you filter for devices in the search analytics at search console you get that from July 29th all the data is tagged as desktop and mobile and tablet have no data from that date. I see that for all my websites I have search console for, any input on that?
Reporting & Analytics | | amirbt0 -
Does Google encryption of keyword data impact SEO revenue reporting in Google analytics?
Hi there, I know Google has been encrypting SEO keyword data which they rolled out in September 2013. My question is - will this impact SEO revenue figures reported in Google analytics? I have been monitoring SEO revenue figures for a client and they are significantly down even though rankings have not lowered. Is this because of Google's encryption? Could there be another reason? Many thanks!
Reporting & Analytics | | CayenneRed890 -
The curse of (not provided) data....
Buongiorno from 23 degrees C Wetherby UK 🙂 Do you ever get the impression Google doesnt Like SEO practitioners? Thing is the (not provided) snag in the key word Analytics data is a complete pain in arse. Yes you can go into webmaster tools and get a feel for organic keyword data but the joy stops abruptly when you need a full picture of traffic acquisition from a specific keyword. So my question is please:
Reporting & Analytics | | Nightwing
"When a client asks, give me traffic data acquired from an organic phrase". How on earth can you give an accurate answer? And to add salt into the wound the traffic data is going to be less so your SEO efforts are going to take a hit". Is the answer use another analytics service?
Grazie tanto,
David0 -
If I am changing my domain for my website and want to keep using the same Google Analytics account to keep the data from the old domain. How should I proceed?
If I am changing my domain for my website and want to keep using the same Google Analytics account to keep the data from the old domain. How should I proceed? Do I have to start a new Google Analytics account for the new domain? If so how do I keep the old data? Or can I use the same GA account? Thank you.
Reporting & Analytics | | brianhughes1 -
Not displaying traffic data
Hello! For my site http://systemfarmer.hu i have connected GA accout.
Reporting & Analytics | | systemfarmer
But if I try to access traffic data, it just keeps loading. Any idea? Regards, Laszlo0