Many of the projects that I’ve been working on over the last 6 to 9 months have evolved to include more and more data integration. Specifically companies are looking to import more Google Analytics data into other systems to do various types of analysis.
Data integration is nothing new, companies have been doing it for a long time. I’ve written in the past about integrating Google Analytics with CRM data.
Other analytics vendors have been supporting this functionality for a while. But I think as more organizations move to Google Analytics they’re looking to get the same functionality as their other tools. Hence this uptick in data integration.
So I figured this would be a good time for a high-level overview of the most common way to integrate Google Analytics with a data warehouse.
What is a Data Warehouse?
In reality a data warehouse is just a big database that pulls from lots of different sources. A businesses would build a data warehouse to do more complex types of analysis. Other types of data you might find in a data warehouse is customer data, purchase history information or demographic information.
When all of this data is put together you can do some interesting types of analysis. For example, you can do better analysis about online behavior and offline conversion. Or tie long-term purchase history to online behavior.
Another benefit of building a data warehouse is the ability to create very custom marketing segments. Adding information about conversion behavior and marketing activities can help augment historical sales data and create new marketing programs.
Gary Angel wrote a great post on some of the reason why you would want to build a data warehouse. Avinash Kaushik also wrote an interesting counter-point, why you might not want to build a data warehouse.
My opinion, this is very useful, but do it in an agile, flexible way that does not take years. This stuff changes too fast to rebuild.
Almost all data integration projects revolve around a primary key. A primary key is a unique number that exists in different data sets. Because this number exists in multiple data sets you can connect the data sets using the key.
To add Google Analytics data to a data warehouse you need to add some type of primary key to Google Analytics. In most of the work that I’ve done this key is a visitor ID. This anonymous identifier usually comes from some other system like a CRM.
When we add the customer ID to Google Analytics we store in a visitor scoped custom variable. This means that the ID is stored in a cookie on the visitor’s machine. We normally set the cookie when the visitor makes a purchase or logs into the site. Basically any time we can positively identify the visitor.
NOTE: The function
_ugc() is a generic function that get’s a cookie value. In this case it’s extracting part of the
Once the ID is in the custom variable we should start to see data in Google Analytics, specifically the Custom Variables report. Here’s the custom var:
And here are all the visitor IDs:
I know what you’re thinking, “You can’t store personally identifiable information in Google Analytics!” But this isn’t personally identifiable information. And besides, this is the same as the transaction ID stored in the ecommerce data.
Now that we’re done debating PII, we need to merge this data with other data. This is where things get complicated. The rest of the process is all programming. And all the programming is custom. The language you use, how you store the data, it all depends on your IT department and their platform of choice. The rest of this post is meant to be a high-level overview.
First, let’s extract the data from GA via the API.
This query will generate a list of all the unique IDs stored in the custom variable. The key dimension in this query is ga:customVarValue5, that’s the value of the Custom Variable in slot 5 which holds the IDs. I’m using the Data Feed Query Explorer to test these queries.
The basic query is something like this:
And the data coming back would look something like this:
So where do you put this query? You need to create some type of code that runs automatically. You might pull data from Google Analytics daily or weekly. It all depends on the volume of your data and your specific needs.
Once I have a list of IDs I can create some looping logic, again part of an automated program, that iterates over the list of IDs and makes an API request for each ID number found above. This second query will get detailed information about each individual ID.
The secret to the second query is adding a custom segment that includes the ID number. Look for 1041533899 in the query below. That’s the ID number I pulled from the first API call above.
With this query I’m getting the data, the traffic source, keyword and all pages that the visitor viewed. I could also pull things like revenue, conversions… But I’m looking for data that I don’t already have via some other system. So I would stick to things that are happening on the site, like where the visitor came from and what they did on the site (watch a video, use a calculator, etc.)
Once this query comes back it’s time to do the data integration. You might want to manipulate the data before you store it in a database or you might choose to insert the raw data. Again, a lot of this depends on your specific environment.
The important thing is that you now have the data. Integrate.
Gotchas To Watch For
There are some definite gotchas when implementing this type of soltion. Most notably, you need to deal with data sampling applied via the API. From the Google Analytics API docs:
Google Analytics calculates certain combinations of dimensions and metrics on the fly. To return the data in a reasonable time, Google Analytics only processes a sample of the data. In these situations, the metric data returned in each entry includes a confidence interval.
So massively huge data sets might need more frequent queries. Or you might need to break the data down into multiple profiles and then query it via the API. Again, this is what makes this type of solution so custom.
Another limitation is that each query is limited to a maximum of 7 dimensions and 10 metrics. But you can get around that with some creative queries.
Thanks for sticking through this rather vague post. I know it lacks some detail, but this is a very specific technique that can differ from one client to another.