Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Search Console - Click Count Inconsistency
In Google's search console I see a discrepancy between click counts. At top I see this here and then beneath I see these kinds of numbers for click counts here. So the top click count says 252 and the bottom section appears to only shows less than 40. Probably a simple explanation here that I'm just not seeing. Thanks!
Reporting & Analytics | | a_toohill0 -
Blog posts not appearing in Search
http://www.themorrisagency.co.uk/blog works well to index my niche terms. Looked at webmaster tools today and since today it has put my indexed pages from 900+ to 11 ??? I have recently added a new site format via a web designer and was all working fine. Now none of my posts are appearing.I have resubmitted site map for 993 pages Is there anything I can do to stop/ prevent this from happening again? and what do you think was the cause? Thanks everyone! Daniel
Reporting & Analytics | | Agentmorris0 -
What is the best way to eliminate this specific image low lying content?
The site in question is www.homeanddesign.com where we are working on recovering from some big traffic loss. I finally have gotten the sites articles properly meta titled and descriptioned now I'm working on removing low lying content. The way there CMS is built, images have their own page (every one that's clickable). So this leads to a lot of thin content that I think needs to be removed from the index. Here is an example: http://www.homeanddesign.com/photodisplay.asp?id=3633 I'm considering the best way to remove it from the index but not disturb how users enjoy the site. What are my options? Here is what I'm thinking: add Disallow: /photodisplay to the robots.txt file See if there is a way to make a lightbox instead of a whole new page for images. But this still leaves me with 100s of pages with just an image on there with backlinks, etc. Add noindex tag to the photodisplay pages
Reporting & Analytics | | williammarlow0 -
Viewing image search data in Analytics?
How do you view image search data in Google analytics? Do you need to apply tracking or can you drill down into the default data? Thanks.
Reporting & Analytics | | Bondara0 -
Google Analytics Tracking Code Queries
Hello, I have taken on a new client who has Google Analytics installed. The tracking code is set to 'single domain'. Recently they added a mobile site using a sub-domain (m.website.com) which means that Google Analytics is not picking up this traffic. I want to revise the account so that I have a master account (raw data) and then profiles for the mobile site, main domain (www.website.com) and one other for a sub-domain that they are using. I am aware that there is mobile specific tracking code however I thought it would be easier (re conversions/goals/eCommerce tracking) to not use this and by changing the account to 'multiple domains' we could also get data for another sub-domain that they are using . My questions are: Am I right to want to use individual profiles over web properties. If not please explain why. When installing the tracking code (where the profile number is changing) I believe that I need to add that code with the changing profile number to the sub-domain sections. So my question is a) is that correct, and b) if I use a profile number on a sub-domain section will the master account still gather the data for the main URL as well as all sub-domains. If I change the master account from using 'single domain' tracking code to 'multiple domain' tracking code will this affect historical data? Will I lose the data? When changing from 'single domain' tracking to 'multiple domain' tracking does this affect eCommerce tracking? Or do we only need to be adding the additional lines of tracking code that allow sub-domains to be tracked? The web developers are using asynchronous code however half is in the and the other half is at the bottom of the source code. Given that traffic is being reported in the Google Analytics account should I have any concerns that the code is split? I have done a lot of reading but seem to be going around in circles, so your help is much appreciated! Thanks,
Reporting & Analytics | | Unity
Dinny0 -
Google encryption of search results
Hello Is there any definitive information on whether the recent changes to the way Google encrypts search results for people logged into Google affects the traffic present via the Google Adwords Keyword tool? Plus, how does SEOmoz ensure/minimise the affect of personalisation/localisation etc. on the rankings data provided via the pro tool. Thanks in advance for your assistance. Kind regards Neil
Reporting & Analytics | | mccormackmorrison0 -
Google Secure Search Announcement: How do you think this will play out?
So, I read this post on the Google Blog and then this commentary about it. Wondering what some of you veterans think about how this will impact results. I always remain logged out for queries and use several different software tools to check when I run an analysis. Since social influences are becoming a bigger factor, do you think this is going to create some issues in data? What are your thoughts?
Reporting & Analytics | | TheARKlady0 -
Site: Query Question
Hi All, Question around the site: query you can execute on Google for example. Now I know it has lots of inaccuracies, but I like to keep a high level sight of it over time. I was using it to also try and get a high level view of how many product pages were indexed vs. the total number of pages. What is interesting is when I do a site: query for say www.newark.com I get ~748,000 results returned. When I do a query for www.newark.com "/dp/" I get ~845,000 results returned. Either I am doing something stupid or these numbers are completely backwards? Any thoughts? Thanks, Ben
Reporting & Analytics | | BenRush0