Many of the projects that I’ve been working on over the last 6 to 9 months have evolved to include more and more data integration. Specifically companies are looking to import more Google Analytics data into other systems to do various types of analysis.
Data integration is nothing new, companies have been doing it for a long time. I’ve written in the past about integrating Google Analytics with CRM data.
Other analytics vendors have been supporting this functionality for a while. But I think as more organizations move to Google Analytics they’re looking to get the same functionality as their other tools. Hence this uptick in data integration.
So I figured this would be a good time for a high-level overview of the most common way to integrate Google Analytics with a data warehouse.
What is a Data Warehouse?
In reality a data warehouse is just a big database that pulls from lots of different sources. A businesses would build a data warehouse to do more complex types of analysis. Other types of data you might find in a data warehouse is customer data, purchase history information or demographic information.
When all of this data is put together you can do some interesting types of analysis. For example, you can do better analysis about online behavior and offline conversion. Or tie long-term purchase history to online behavior.
Another benefit of building a data warehouse is the ability to create very custom marketing segments. Adding information about conversion behavior and marketing activities can help augment historical sales data and create new marketing programs.
Gary Angel wrote a great post on some of the reason why you would want to build a data warehouse. Avinash Kaushik also wrote an interesting counter-point, why you might not want to build a data warehouse.
My opinion, this is very useful, but do it in an agile, flexible way that does not take years. This stuff changes too fast to rebuild.
Almost all data integration projects revolve around a primary key. A primary key is a unique number that exists in different data sets. Because this number exists in multiple data sets you can connect the data sets using the key.
To add Google Analytics data to a data warehouse you need to add some type of primary key to Google Analytics. In most of the work that I’ve done this key is a visitor ID. This anonymous identifier usually comes from some other system like a CRM.
When we add the customer ID to Google Analytics we store in a visitor scoped custom variable. This means that the ID is stored in a cookie on the visitor’s machine. We normally set the cookie when the visitor makes a purchase or logs into the site. Basically any time we can positively identify the visitor.
For the sake of this example I’m going to use a little piece of JavaScript to create a unique ID for visitors. I’m actually going to extract a unique ID from the Google Analytics __utma cookie. Then I’m going to set the unique ID in a custom variable, specifically in Custom Variable slot 5.
NOTE: The function _ugc()
is a generic function that get’s a cookie value. In this case it’s extracting part of the __utma
cookie.
Once the ID is in the custom variable we should start to see data in Google Analytics, specifically the Custom Variables report. Here’s the custom var:
And here are all the visitor IDs:
I know what you’re thinking, “You can’t store personally identifiable information in Google Analytics!” But this isn’t personally identifiable information. And besides, this is the same as the transaction ID stored in the ecommerce data.
Now that we’re done debating PII, we need to merge this data with other data. This is where things get complicated. The rest of the process is all programming. And all the programming is custom. The language you use, how you store the data, it all depends on your IT department and their platform of choice. The rest of this post is meant to be a high-level overview.
First, let’s extract the data from GA via the API.
This query will generate a list of all the unique IDs stored in the custom variable. The key dimension in this query is ga:customVarValue5, that’s the value of the Custom Variable in slot 5 which holds the IDs. I’m using the Data Feed Query Explorer to test these queries.
The basic query is something like this:
https://www.google.com/analytics/feeds/data?
ids=XXXXXXXXX&
dimensions=ga:customVarValue5&
metrics=ga:visits&
start-date=2011-04-20&
end-date=2011-05-04&
max-results=50
And the data coming back would look something like this:
So where do you put this query? You need to create some type of code that runs automatically. You might pull data from Google Analytics daily or weekly. It all depends on the volume of your data and your specific needs.
Once I have a list of IDs I can create some looping logic, again part of an automated program, that iterates over the list of IDs and makes an API request for each ID number found above. This second query will get detailed information about each individual ID.
The secret to the second query is adding a custom segment that includes the ID number. Look for 1041533899 in the query below. That’s the ID number I pulled from the first API call above.
https://www.google.com/analytics/feeds/data?
ids=XXXXXXXXXXX&
dimensions=ga:customVarValue5,ga:date,ga:source,ga:medium,ga:keyword,ga:pagePath&metrics=ga:visits&
segment=dynamic::ga:customVarValue5==1041533899&
start-date=2011-04-20&
end-date=2011-05-04&
max-results=50
With this query I’m getting the data, the traffic source, keyword and all pages that the visitor viewed. I could also pull things like revenue, conversions… But I’m looking for data that I don’t already have via some other system. So I would stick to things that are happening on the site, like where the visitor came from and what they did on the site (watch a video, use a calculator, etc.)
Once this query comes back it’s time to do the data integration. You might want to manipulate the data before you store it in a database or you might choose to insert the raw data. Again, a lot of this depends on your specific environment.
The important thing is that you now have the data. Integrate.
Gotchas To Watch For
There are some definite gotchas when implementing this type of soltion. Most notably, you need to deal with data sampling applied via the API. From the Google Analytics API docs:
Google Analytics calculates certain combinations of dimensions and metrics on the fly. To return the data in a reasonable time, Google Analytics only processes a sample of the data. In these situations, the metric data returned in each entry includes a confidence interval.
So massively huge data sets might need more frequent queries. Or you might need to break the data down into multiple profiles and then query it via the API. Again, this is what makes this type of solution so custom.
Another limitation is that each query is limited to a maximum of 7 dimensions and 10 metrics. But you can get around that with some creative queries.
Thanks for sticking through this rather vague post. I know it lacks some detail, but this is a very specific technique that can differ from one client to another.
Stupendous post Justin! :-)
I think this is a great summary of how to get up and running on merging Google Analytics and back-end. Just wanted to refer readers to an interesting presentation by Adam Greco about merging Web Analytics and CRM (back when he was still at Salesforce): Integrating Web Analytics & CRM (see links to part II and III at the end of the post).
Hi Justin,
Thank you for an interesting morning reading. I think this is a good way to create e.g. statistical visitor segments and use those to improve the site. The EULA limits the usage of GA for e.g. recommendations, as does the sampling in any larger sets of data, but since we’re talking about statistics here, I see this a perfectly valid use case.
The problem I see is the limit of API queries if you’re working on a larger set of data, since this method creates a lot of them (especially if you split the data or re-query to avoid getting too sampled data). One thing to consider is to use a segment on the first step to make sure we’re getting visitor IDs that can be used for what ever need in question (e.g. if engaged visitors are what you are looking for, those who only bounced are of no interest).
Hi Justin,
There’s another limitation that applies to sites with a high daily traffic.
You can only store 50K different values each day (custom variables).
One other “gotcha” or, more aptly, “cautionary note” goes back to the PII limitation. Google does not *just* prohibit the collection of PII. The Terms of Service actually read: “You will not (and will not allow any third party to) use the Service to track or collect personally identifiable information of Internet users, nor will You (or will You allow any third party to) associate any data gathered from Your website(s) (or such third parties’ website(s)) with any personally identifying information from any source as part of Your use (or such third parties’ use) of the Service.” One of the driving reasons behind creating a data warehouse is to combine data from multiple systems. While the visitor ID is not directly PII, if the visitor ID is tied to, say, an eCommerce transaction with the transaction ID, and the transaction ID is used to tie to a person from a backend system…that’s a violation of the Terms of Service.
This is a slick idea — to through user-level data into a custom variable and then use that as the atomic level detail being pulled into a data warehouse. But, once you’ve done that, you may be enabling doing things with the data that are expressly forbidden, so you need to take a good, hard look at who has access to that data and ensure that they know what is/is not an acceptable use of the data.
Hi Justin
Thanks for your post. I think visitor level data is an obvious route to take for any company that wants to set out to segment their visitors, integrate online-offline data and ultimately build predictive models for targeting. However I’d be curious to know if anybody has succeeded doing this with GA because my experience is that your data doesn’t have to be massively huge for sampling to kick in. and while sampling at website- or campaign-level metrics may be acceptable sampling at visitor level data is definitely NOT.
Tools such as AnalyticsCanvas try to prevent sampling by splitting up into more granular queries but I’m not sure how successful this approach may be without Google explicitly supporting visitor-level data & analysis. My conclusion is that you may successfully integrate visitor-level data going forward if a day’s worth of data remains untouched by sampling. But any backward looking analysis relying on batch export of months’ of data is a futile effort.
Would love to hear from you & others who have attempted this stuff.
Cheers
Jiri
@jiribrazda
@Jiri: Great point. Historical data will be every difficult to extract from Google Analytics due to sampling issues. I think this is more of a “forward looking” type of technique.
@Eduardo: I think @Andreas was talking about the table size. All GA tables are limited to 50k rows of data per day. This includes the custom variable tables, content tables, keyword tables, etc. This table row limit is in addition to any sampling that is applied during export via the API.
@Andres: Yes, you are correct. GA is limited to 50,000 unique values for a custom variable during a single day. Certainly a limit that a lot of larger organizations will hit.
@Sampsa: A great idea. Using segmentes to break down the data, thus reducing the number of API calls is certainly possible. Again, it all depends on the amount of data you have and the information that you want to import to your data warehouse.
@Tim: Spot on, 100% correct. I know Section 7 very well :) But I believe that this type of integration is a completely valid method that lives in a grey area. I believe that the TOS is out of date with the current ecommerce tracking code. But in the end it’s up to every organization to evaluate their needs and ho to meet them.
Thanks to everyone for the great comments.
@Andres,
In fact that 50k table limit is not a limit applied on table storage. In fact it’s a limit enforced on data query.
You can segment visits for IDs that start with ‘1’, and then the 50k limit is applied to that and you might see user id’s that would be hidden if queryed all IDs.
Of course heavy segmentation tend to cause more sampling. But still the data is there and may be accessible.
Great post, Justin. And great books, too :).
We’ve actually done something remotely similar, in order to calculate the “margin per view” (among other things) for an e-commerce website.
We took the product page visits from Google Analytics and the sales and margins from an external piece of software, and found out which were the products sold with almost every view, and among them, since margin is variable, which are the most profitable.
Now I’m wondering if visitor ID could not give you a pretty good conversion attribution, if you could get visitor ID, referrer, date and conversions.
@Calin: Your approach sounds almost identical. I assume you used Transaction ID as the primary key between GA and your external data source. Then segmented the data in Google Analytics based on the Transaction ID.
In theory yes, you could create a visitor-centric custom report with source/medium, date and conversions. Then filter that report using the new GA Custom Report filter, to get a list of when the visitor was on the site and what happened. You could also do the same thing via the API. LOTS of data though.
Thanks for the comment!
Wow! Talk about a fast answer. No, we actually used the product page URLs, which are unique per product on our shops, and are generated in our shop administration system. From there we got SKUs as well, so it was enough to get unique page visits for product pages URLs and the sales data from the “ERP” and that was it. X sold products with Y margin gives Z margin for product T which had U unique product page views.
Your approach could also have worked, of course, in theory, but I don’t think we push transaction IDs into the ERP, we just import orders and that’s it. But thanks for the hint, I actually think we should push this data into the ERP :).
@Calin: Thanks for sharing!
@Jiri : I also think that sampled data would be crappy at visitor level since I already experience sampling issues when I breakdown per country ! To me, GA is not made for such a granularity…
@Justin : what do you mean by “forward looking type of technique” ?
@Mickale: By forward looking data I mean that, using this technique, you can not link past data to the data warehouse. The reason being this technique using a unique ID in a cookie. And that cookie did not exist in the past. The cookie will only exist from the day the code runs into the future.
Nice read!
One question – Is there a reason you’re only grabbing one portion of the visitor #? My utma cookie comes back with a whole set (265577865.1977082974.1305214903.1305214903.1305214903.1), so is one set more important or unique than another?
@Nik: I could have used the entire __utma cookie, but there is only one part of the cookie that is unique to each user. I just chose that part arbitrarily. Good question, thanks!
Hi Justin,
Nice article.
I was wondering why all of you mention a 50k url/keyword (table) limit. As far as I can tell the API documentation (http://code.google.com/apis/analytics/docs/concepts/gaConceptsSampling.html) describes a limit of 1.000.000 unique values for any given time period. So I would think it should be possible to do 100 request of 10.000 result values per day to fetch 1.000.000 unique customVar values. The trick would be to pick the smallest time-frame possible to avoid limitations/sampling. Fetching more data at once would obviously decrease your chances of fetching absolute data sets.
Can you be a bit more specific on this topic? Any experience?
I am planning to do a test run on one of our keyword reports today.
Thanks!
Gerron
Gerron: The 50k limit has nothing to do with the Data API. The actual data tables that store your daily Google Analytics data have a limit of 50,000 rows. Once you fill all 50,000 rows you can’t store any more data for that day.
For example, let’s say you have 120,000 different organic keywords for a given day. Google Analytics will only record the first 49,999 and then bucket the rest in a single item called (other).
Hope that clarifies things!
I just did a small test. I managed to get 90.000+ unique keywords back via the API for a single day from a single profile (dimension: keyword; metric: visits).
Hi Gerron,
I sort of have a fuzzy memory (or maybe I dreamed it) that Google had extended the keywords table to 100.000 unique values daily. But that is the only one that could collect more than 50.000 unique values daily.
Justin, do you know or remember anything about this?
Andrés
@Andres & @Gerron: As of 12 days ago the Keyword table was still limited to 50,000 daily unique values.
Justin –
Question regarding how and when GA writes cookies to the browser. If it is a “new visitor’s” first pageview on the site, based upon the above method the _utma cookie won’t exist and if the visit is a bounce, this custom var won’t get populated in GA’s reports, (Indeed, I have seen this come up as (other) in my own reporting). Is there anyway around this? In other words, is there a way to grab the __utma cookie values and send them to a customVar before trackPageview on a visitors first page of their first visit to the site? I’ve been getting frustrated by a lack of visitor level data from those first time bouncers and I’m hoping I’m just missing something here.
Thanks,
Yehoshua
@Yehoshua: Absolutely fabulous question. Custom variables are cached by the GA JS and sent on other standard utm.gif requests. As you point out, if a visitor bounces then you loose any custom variables that may be set on that page. Right now the only way around that is by sending another piece of data back to Google, like an event.
So in my code, if it is the visitor’s FIRST visit to the site then you will not be able to record the custom variable in GA. You could use a different unique ID, either something generated at the server OR perhaps something generated in the JS. But grabbing the __utma cookie value will not work in the situation you describe. I would create the unique ID at the server and then push that value into a custom variable.
Hope that helps.
Hi Justin, thanks for your post! PII a side is very usefull :-)
I have tried your code this way
_gaq.push([‘_setAccount’, ‘UA-XXXXXXXXX-1’]);
var a = _uGC(document.cookie, ‘__utma=’, ‘;’);
var id = a.split(“.”)
_gaq.push([‘_setCustomVar’,4,’Visitor ID’,id(1),1]);
_gaq.push([‘_trackPageview’]);
but an error message “_uGC is not defined” any idea to fix this?
Thanks in advance!
Jaume
@Jaume: _uGC() is a small function that I use to get cookie values. Here is the code for the function:
//
// This is a function that I "borrowed" from the urchin.js file.
// It parses a string and returns a value. I used it to get
// data from the __utmz cookie
//
function _uGC(l,n,s) {
if (!l || l=="" || !n || n=="" || !s || s=="") return "-";
var i,i2,i3,c="-";
i=l.indexOf(n);
i3=n.indexOf("=")+1;
if (i > -1) {
i2=l.indexOf(s,i); if (i2 < 0) { i2=l.length; } c=l.substring((i+i3),i2); } return c; }
Hi Justin,
Awesome post. I use a similar method to this and it works great for tracking users across multiple visits and back to their initial keyword. I believe that now non interaction events are out, you could do it for all visitors (even those that are bouncing).
You could also create a virtual pageview to run if __utma doesn’t exist and filter it out on the profile side.
@Justin Thanks a lot :-)
Hi Justin,
Fantastic post. This article helps me a lot for development
Really great article and extremely useful!!!! I was looking for it! Thanks!!!! @rociomiroES