What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Analytics no data
Hi I tried to add an analytics script. Google Tag Assistant recognizes the script that was added properly, but I don't receive any data in Analytics. I also have tried to implement the Analytics script with the Monsterinsights plugin, the code is well recognized by Google Tag Assistant, but I don't receive any data in Analytics. What is going wrong here? Website: https://www.dakwerken-vandriessche.be Thanks for your advice! BmRUkFJ
Reporting & Analytics | | conversal1 -
How do sites without access to a site's analytical data, determine a site's organic traffic?
I've recently used a organic traffic checker that showed you your traffic compared to each google algo update. I was interested in how they derived the organic traffic totals for each month, without having access to our site's google analytics? I've since compared the data to historical google analytics data and it's not wrong, isn't 100% match either but isn't far from fact. So if they're predicting or making a guess, it's rather spot on, site crawlers and SERPs snapshots only provide so much info, I'm just wondering where they get the rest from and how?
Reporting & Analytics | | Deacyde0 -
Referral Data Q's
1. We recently ran a promotion on both FB and Reddit, which is https, linking to our non-https site. We utilized UTM links to our landing page. Our GA campaign data returned extremely low hits in comparison to what we actually received (and recorded via FB/Reddit dashboard). Obviously our Direct traffic spiked during these times, caused by a secure to nonsecure referral, I'm sure. I'm also noticing a spike in referral traffic from lm.facebook.com that correlates to the ad times. Does this mean Facebook's link shim is stripping away my UTM data? My question is why we receive SOME properly UTM-tagged referral traffic in our campaigns? What's allowing some of it to go through? 2. I've tagged our email signature links with UTM as well, hoping to clean up some of our Direct traffic. I understand that external clients like Outlook and Thunderbird likely won't pass referral data, but do hosted clients like Gmail, Yahoo, and such? And if so, would the https to http difference obstruct this again? I'd love some insight onto these questions, especially if I'm off the mark with a few of my assumptions there.
Reporting & Analytics | | kirmeliux0 -
Possible penalty question - need expert help
hallo everyone, I am posting this question to the MOZ community, because I could not find any useful information or proper advice so far, even after consulting a few local SEO experts. I noticed from the end of september a steady and consistent decrease in visits (please see attached pdf) for my website https://bastabollette.it I lost so far almost 40%. Please consider that I have not changed my habits in blog posting lately, both in quantity and quality. I have not made any subtantial change on the website lately. I did a general audit of the site asking to an expert but apart from some generic suggestions (like: "work on increasing PR, add more quality backliks, use more no-follow links, fix broken links" - things I am currently going to fix anyway) I don't really understand the reason of the drop. Please also note the strange drop of 11/22/15 (see search console screenshot). Can you please help me? thank you. Selezione_018.jpg Selezione_019.jpg
Reporting & Analytics | | micvitale0 -
Something does not add up with WMTs search analytics data
we recently replatformed our main site and switched to https. For the first 2-3 weeks after we moved organic traffic was great, we did not lose any ( increased a little), but then it dropped off significantly. Attached is a screenshot from one of our main keywords that dropped off. You can see click (blue) and impressions (red) dropped off, and the position became unstable, but in the last week it has stabilised to about the same position it was before, but the clicking and impressions are still very low. The keyword is generic (for our industry) and there would not be any major seasonal changes in the search volume. I can't make sense of this data, could be be wrong? Kd3p5f9.jpg
Reporting & Analytics | | PaddyDisplays1 -
Different transaction data social channel on Google Analytics
Hi there, On my ecommerce, in order to find out the number of transactions that came directly from my social channels I go to Acquisition->Channels and then I look at the column Transactions for the Social channel. Surprisingly this number differ from the one that I find under Acquisition->Social->Overview and then I select Transaction under Conversions. Then I look at Last interaction Social Conversions to find the number of transactions. The two quantities are totally different for a month window and they should be the same. Can any of you explain to me the reason why? Thanks and regards
Reporting & Analytics | | footd0 -
Google Webmaster Tools is showing wrong data - help?
Hey all, I'm seeing some weird problems with Webmaster Tools. Specifically: We've submitted a sitemap with 174k URLs. According to the WMT dashboard, only 21 are indexed, though if you search our site via site:sitename.com blah blah, there are thousands of pages returned. Why is WMT only showing 21 indexed pages? Yet if I go to Health -> Index Status, it's showing nearly 199k URLs indexed. This seems consistent with searching Google site:sitename.com blah blah. Under "Search Queries", it's showing "no data available". Not sure why as it's linked to the proper Google Analytics account, which has keyword data. Any ideas what I'm doing wrong here? Thanks.
Reporting & Analytics | | chimptech0 -
No Link Data Available for this URL appears often. Are my sites too small to show up?
I am on the trial until Mar 5, 2012. I seldom get the info I want. Is it because my sites are too narrow a niche? I don't seem to be getting the data I'd like from your service. I'm trying to like it, but when I keep getting messages like this, it makes it hard to justify: "No Link Data Available for this URL appears often" Sample sites that I am unable to get data. I especially would like to know how many backlinks exist for each site. I paid someone to help me with them and I'd like to verify their work.: http://costaricadentistreview.com/ http://costaricadentistreviews.com/ http://costaricadentalimplants.org Any suggestions? Thanx Kurt Gross
Reporting & Analytics | | kurtray0