Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How do you analyze a traffic drop with no historic Google Analytics data?
A client of mine has a large website with multiple sections (shop, forums, articles, etc.) that apparently had a significant reduction in rankings, traffic, and sales in the past. However, historic Google Analytics data is not available for the site, and I'm having troubles identifying anything concrete about the traffic drop, such as when it happened, what pages/sections it happened to, etc. The shop traffic drives most of the revenue, but it's a small number compared to the forums traffic, so it's hard to pick anything out of top-line trends like SEMrush offers. What tools or strategies might help in this situation?
Reporting & Analytics | | AdamThompson0 -
Why do WMT's Search Analytics & GA's Real Time Keywords Differ?
I took a screen shot of the keywords from Google Analytics Real Time Keywords for Google / Organic. Then I waited a week and filtered Google Webmaster Tools' Search Analytics for queries with those keywords. None of them showed up. Why not? What I missing? Why do some keywords show up in Google Analytics RT for Google / Organic while most others are in Not Provided? I see WMT's Search Analytics says that it doesn't always show queries that "are made a very small number of times or those that contain personal or sensitive information." But most of my queries are like that. Anybody know any patterns / criteria that leads WMT to show some but not others? Thanks
Reporting & Analytics | | GilReich0 -
Is there an efficient way to block/filter referral spam in Google Analytics for a large network of websites?
Hello, everyone - I'm looking for guidance on how to block or filter referral spam in Google Analytics. But I'm needing to block for an entire network of Wordpress websites. We have two networks which total over 2,500 websites. We are currently blocking sites we find out about via htaccess. This works, but only after we see we are getting hit with the spam. Updating 2,500+ Google Analytics accounts with filtering is not an ideal option due to the time factor and the fact that new bots coming out almost daily. We can continue the htaccess method, but does anyone have any other ideas for blocking referral spam for a large network of sites? These are the other ideas we have. 1. Blocking all traffic from Russia and China based up subnets. We know many will still get through, but it should block 50% of it, we hope.
Reporting & Analytics | | copyjack
2. Moving sites to Google Tag manager. This is a huge tasks but we have seen that sites using Tag Manager are not effected, at least for now. Other ideas are appreciated!0 -
Is there an automated way to determine which pages of your website are getting 0 traffic?
I'm doing a content audit on my company website and want to identify pages with zero traffic. I can use GA for low traffic, but not zero traffic. I can do this manually, but it would take a long time. Are there any tools to help me determine these pages?
Reporting & Analytics | | Ksink0 -
Linked my adwords account to GA and vice versa and still paid search is getting recorded into organic traffic??
Hi Mozzers, I have linked properly my adwords account to GA and vice versa and somehow I can see 3/4 of this paid traffic recorded to organic search. The most confusing part is that I can see 1/4 of the paid traffic under the "paid" metric. At this point I don't know really what should I do? Thank you guys in advance!
Reporting & Analytics | | Ideas-Money-Art0 -
How Do Queries And Impressions Relate?
For one of our keywords, i have 2,500 impressions this past month, but there were only 1,300 queries according to Google's keyword planner. How can I have more impressions than queries? If anything, I thought it would be the other way around. If someone could flush this out for me, I'd be incredibly grateful. Thanks, Ruben
Reporting & Analytics | | KempRugeLawGroup0 -
Filter out IP address of Site Search analytics
Hi Mozzers, I have a filter that excludes all internal traffic from my sites. But this does not seem to work on site search > Search Terms See here:- http://productforums.google.com/forum/#!searchin/analytics/filter$20site$20search/analytics/pO18L31hEO4/tJ3lKVNT3YYJ Any ideas? Or is it a bug, etc Thanks S
Reporting & Analytics | | Metropolis0 -
Does Google Analytics parse visits from search apps?
Does anyone know if Google Analytics reports visits to your website differently from individual search apps like Google and Bing? Or do they just treat them the same as any other keyword visit from Google or Bing search engine? I suppose the end result is probably the same as in the Google app you're using Google so it would just be a different access point versus a new tool. I'm just curious if there is a way to see how many Mobile visits are coming from the apps vs the browser. For me personally I have the Google and Bing iOS apps installed but rarely use them, opting for the Safari search bar 99% of the time.
Reporting & Analytics | | nsauser0