We all know that data quality sucks! But there are a few, vital steps that you can take to insure that your Google Analytics data is as accurate as possible. Remember, accurate data makes for happy, and accurate, analysts.
Here are three simple tips that can help make your data more accurate.
1. Eliminate Duplicate Data
Many sites that I work on have duplicate data. The usual cause is mixed case URLs. Google Analytics is case sensitive, it captures the data exactly as it appears in the location bar of the browser. So if a URL is of mixed case in the browser, it will be captured and displayed in mixed case within GA.
It’s very easy to have two URLs, that have the same functional meaning, appear as two line items in GA because they have a different case. Here’s an example:
/worldseries/index.php?year=2007&keyword=lowell
/Worldseries/index.php?year=2007&keyword=Lowell
Both URLs are probably the same, they just appear different because of the case. We want to force both URLs to have the same case and thus make them appear as a single line item in GA. This can be done with a Lowercase or Uppercase filter. I like the lowercase filter, but you could easily use the uppercase filter. It’s a personal preference.
The filter below will force the Request URI to lowercase:
I recommend adding a case changing filter to any data element (i.e. filter field) that could be mixed case. This includes:
- Request URI
- Campaign Name
- Campaign Term
- Campaign Medium
- Campaign Source
Another cause of duplicate data is multiple URLs that display the same content but have a different file extension. Here’s an example:
/champions/redsox.php
/champions/redsox.htm
These URLs may appear different (because of the file extension), but the web server might interpret them as the same file. Please note that not every web server behaves this way. It all depends on your web server. Check with your IT guru if your site has URLs with multiple file extensions.
You should merge duplicates URLs, that have different file extensions, into a single line item. I find the best way to do this is with an advanced filter.
Some may think that a search and replace filter is the best way to remove these duplicates. But you would need to create a search and replace filter for each set of URLs that needs to be merged. An advanced filter, because it uses a regular expression, will change every URL that ends in ‘.htm’ to a ‘.php’ extension.
2. Remove Irrelevant Information
Extra information in the URL can cause big problems in Google Analytics. The reason is that GA will capture all of the data in a URL, which includes the query string parameters. Query string parameters that don’t have a functional meaning should be removed from the URL.
An easy way eliminate these parameters is to collect data for a week and then analyze the top content report. Any query string parameter that does not provide insight into what the visitor sees or does should be eliminated.
To remove a query string parameter from GA simply add it to the ‘Exclude URL Query Parameters’ field in the profile settings:
Enter multiple parameters as a comma separated list.
Be aware that once you remove a query string parameter from GA it is completely eliminated from the system. So any goals, funnels or other filters that use that parameter will no longer work.
Also remember that you should remove any query string parameters that contain personally identifiable information. It is against the GA terms of service to collect PII.
3. Identify Your Segment
I could have easily named this tip ‘exclude internal data’ but I wanted to change the way we all think about profiles and the data that’s in them. I believe we should think of profile data in terms of the segment we want to analyze, not who we want to exclude. I know these statements are very close in meaning, but there is a slight difference. Segmentation is so important to analysis. I believe that every time we create a profile we should consider what segment of data it will contain.
I can think if a few segments of data that I would like to analyze:
- CPC traffic
- New visitors
- Return visitors
- European visitors
- Traffic from a specific marketing campaign
- Non-employee traffic
- Traffic generated by my call center
All of the above segments can be created as different profiles using include filters. Each will provide some insight into that segment. Don’t get me wrong, you’ll probably want to exclude internal traffic from 99% of your profiles. But try to think in broader terms, focus on the segment that you want to analyze.
Creating a profile based on a particular segment of traffic is pretty easy. The first thing you want to do is identify what segment of traffic you want to include in your profile. Then create a filter based on the filter filed that represents that segment.
Let’s say I want to see all traffic generated from visitors performing some type of external search on my name. I could apply the following include filter to a profile:
This filter can easily be modified to include a specific marketing campaign (using the Campaign Name field), a specific country (using the Visitor Country field) or any other segment of data so long as it is represented by one of the filter fields. Please note that this will work even if you’re using AdWords auto tagging on, even though you haven’t done any heavy lifting to define the Campaign Term.
By the way, you will want to exclude internal traffic from many profiles. My favorite way to remove internal traffic from a profile is with an ‘Exclude all traffic from an IP address’. Make sure you use anchors at the beginning and end of the regular expression.
Another good way to exclude internal traffic, especially if you don’t have a static IP address, is to use a little hack called Count Me Out. This hack uses the GA custom segment cookie to identify users.
So remember, yes, you need to exclude internal traffic, but try to take a broad view and think about segmentation when you filter your profiles.
Excellent post Justin. Hopefully this is already included in your most excellent e-book that everyone should buy ten copies of!
Thanks, really enjoyed the actionability of this post.
-Avinash.
Thanks Avinash!
Hi Justin,
Thanks for that post – good stuff! I had a question about the ip filter – I have been doing this for a while for a bunch of IPs corresponding to different offices around the country. Is it really necessary to have the caret ^ at the front of the match? I wonder if i have been screwing this up all this time.
Thanks,
W
Hi Justin,
I’ve recently set up the site search on the art site I work on for a client. As the keywords used in the site search began to be listed I noticed many of them either had a capital letter as a first letter or all the letters were capitals. So I was interested in using the lowercase filter to remove the capitals from the site search phrases. The regular keywords used to reach the site are all lowercase already. So I was excited to follow your lowercase filter example. However, it didn’t change the case of the site search keyword phrases. Is there another filter to make the change with?
Thanks for the help.
Rhoda Schueller
Hi Rhonda,
Unfortunately the on site search processing happens before filters are applied to the data. That means that we can not use filters to manipulate the reports. The only way to manipulate the data in the on site search reports is programatically. I’ve wirtten a bit about it here.
Thanks for the comment and thanks for reading.
Justin
Ward,
Using the carrot depends on your IP address. If the IP address starts with two characters, like this:
12.123.123.123
then you ABSOLUTELY MUST use the carrot. The reason is that if you do not use the carrot, your IP regular expression will match ANY IP address that has a digit BEFORE the 12. That means the following IPs will match your reg ex:
112.123.123.123
212.123.123.123
Thanks for the question and thanks for reading the blog!
Justin
Hi Justin
Great post.
Another thing that could be added to the list is a filter that replace %20 with whitespace or the other way around. Of course whitespaces shouldn’t be in URLs but when they do, they show up differently if the visitors use Firefox and Internet explorer.
/Martin
Great idea Martin, thanks!