Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Organic search traffic down 60% since 8/1/18\. What now?
I have a small health & fitness blog, and my Google search traffic suddenly dropped 60% around August 1 (I've attempted to link an image). My rank has dropped for 86 keywords. I have no manual penalty, so I'm guessing I was affected by the algorithm change. My technical skills are VERY limited. I've tried to find answers on my own, but every time I try to "fix" something, I only seem to make it worse. I do seem to have some structural/performance issues with my site (e.g., lots of 404 errors from uninstalled plugins and unwanted permalinks). I asked my server for assistance (I used managed Wordpress hosting), and they said they couldn't help. As you can imagine, this is quite devastating, and I have no clue where to go from here. I don't know if I'm allowed to link to my site here, but it's mommyrunsit dot com. Any assistance is greatly appreciated. Thanks. Sharon 0lPu4wY
Reporting & Analytics | | RoniFaida1 -
Search my keyword on google
When I search my keyword on google I can see my website, but when I connect to a VPN and again search it, I'm not in the search result, what happens on my website?
Reporting & Analytics | | Pintapin0 -
Did Analytics change the way to handle Google images searches on Dec 12?
Dear all, One of the sites I'm monitoring receives a lot of traffic from image searches or images that appear in universal search results. On Dec 12th, 2015, the bounce rate for these sessions went from around 30% the day before to around 87%. See screen shot below. Did anybody notice similar bounces in the bounce rate? Did Google change something in the way that image search is handled? Looking forward to your ideas! large?v=mpbl-1&px=999
Reporting & Analytics | | AABMarketing_Frank0 -
Differences in site search revenue in GA
I just put in a piece of software to replace a really bad built in site search engine on my 3dcart website. Now I am trying to measure the change, but I am having some issues. When I check the ecom data in the conversions section of GA with the built in segment Performed Site Search, I get promising results. Approximately 5% revenue increase over LY. But if we jump to behavior, site search, usage, and then check the visits with site search, I get a decrease by 4%. And the actual revenue is off, by like double (150k compared to 80k) Anyone have any idea why I am getting these results? The site search function is set up. Tracking is enabled, query parameter is keyword and search url is /search.asp?keyword=
Reporting & Analytics | | ShockoeCommerce0 -
Is there a way to apply the same google analytics filter to multiple properties?
I manage WordPress multiple sites for my clients. Some of these clients are SEO customers. One issue I have with the analytics reports is the occurrence of spam and ghost spam. I know how to create filters to block however there are always more and more. Is there away to export the filter and import it to all the other properties i manage? Or do they just have to be done manually every time i need to block spam?
Reporting & Analytics | | donsilvernail1 -
"Local Search Volume" For what period (is it a month?)
When I look at the heading "Local Search Volume" or, "Global Search Volume... Are these figures for 1 month (that's what I'm assuming) Thanks
Reporting & Analytics | | bricktech0 -
Viewing image search data in Analytics?
How do you view image search data in Google analytics? Do you need to apply tracking or can you drill down into the default data? Thanks.
Reporting & Analytics | | Bondara0 -
Search within search? Weird google URLs
Good morning afternoon, how are you guys doing today? I'm experiencing a few Panda issues I'm trying to fix, and I was hoping I could get some help here about one of my problems. I used Google analytics to extract pages people land on after a Google search. I'm trying to identify thin pages that potentially harm my website as a whole. It turns out I have a bunch of pages in the likes of the following: /search?cd=15&hl=en&ct=clnk&gl=uk&source=www.google .co.uk, and so on for a bunch of countries (.fi, .com, .sg, .pk, and so on, maybe 50 of them) My question is: what are those pages? their stats are awful, usually 1 visitor, 100% bounce rate, and 0 links. Do you think they can explain my dramatic drop in traffic following Panda? If so, what should I do with them? NOINDEX? Deletion? What would you suggest? I also have a lot of links in the likes of the following: /google-search?cx=partner-pub-6553421918056260:armz8yts3ql&cof=FORID:10&ie=ISO-8859-1&sa=Search&siteurl=www.mysite.com/content/article They lead to custom search pages. What should I do with them? Almost two weeks ago, Dr. Pete posted an article untitled Fat Panda and Thin Content in which he deals with "search within search" and how they might be targeted by Panda. Do you think this is the issue I'm facing? Any suggestion/help would be much appreciated! Thanks a lot and have a great day 🙂
Reporting & Analytics | | Ericc220