Love em or hate em, regular expression are a part of Google Analytics. They provide a lot flexibility but at a price. Small mistakes can become magnified and result in poor data quality.
I know there’s a lot of information out there about regular expressions, but I wanted to simplify the topic. In my opinion, here are the most important things to know.
Key Concept: How GA Regular Expressions Work
Let’s start by talking about how regular expressions work in Google Analytics. In general, we apply a regular expression to a piece of data. If the expression matches ANY part of the data then the expression will return TRUE. If the expression returns TRUE then some action will occur.
It doesn’t matter where you use the reg ex. If it’s part of an exclude filter, and the expression matches the data, then the data will be excluded. If it’s part of an include filter then the data will be included. If it’s part of a report filter then the report will only contain info that matches the reg ex. You get the idea.
[In this image think of the data as the square cube and the red work bench as the regular expression. If the cube is the same shape as the hole in the bench then an action happens; the cube falls through. Get it?]
It’s really important to understand this because it simplifies the expressions we need to create. Let’s say I want to identify all the keywords in a set of data that contain the term excel
. Here’s the full list:
word
excel
ms excel
excel 2003
linux
microsoft excel
excel 2007
excel makes pretty graphs
google
Rather than create some fancy regular expression, I can simply use: excel
. After the expression is applied to the data we’ll have the following sub-set:
excel
ms excel
excel 2003
microsoft excel
excel 2007
excel makes pretty graphs
This simplifies the creation of your expression because you only need to match part of the data that you’re looking for. With that in mind, let’s move on to some tips that cover the most common uses of regular expressions.
Tip #1: Use Anchors
Anchors are a way to specify if a regular expression should match the begining of the data or the end of the data. Remember, reg ex works by matching ANY PART of a piece of data. Sometimes we’re looking for data that starts or ends a particular way and that’s why we need anchors. Let’s go back to the
excel
example.
word
excel
ms excel
excel 2003
linux
microsoft excel
excel 2007
excel makes pretty graphs
google
Suppose I only want to see the items that END with the word excel
. Well, if I use the regular expression excel
, I’m going to get all the items that contain the word excel no matter where it appears.
I need to create a reg ex that means, “ends with.” That’s done by placing a dollar sign, $, at the end of my reg ex. So the expression to find all of the keywords that END with excel
would be: excel$
.
It would match the following items from our list:
excel
ms excel
microsoft excel
To find all of the keywords to START with excel
use a carrot, ^, at the beginning of the regular expression, like this: ^excel
. It would match the following items from the list:
excel
excel 2003
excel 2007
excel makes pretty graphs
Now, let’s say I want just the keyword excel. Here’s how that expression would look: ^excel$
.
Anchors, pretty handy.
Tip #2: Find This OR That
Many times in an analysis we’ll want to find multiple items from a set of data. For example, let’s say I want to find all the keywords that contain the name of an MS Office product. The complete list of keywords is:
word 2007
microsoft excel
outlook express
powerpoint
windows 95
mac OSX
linux
google rocks
Again, I’m only interested in the MS Office products, so I need to create an expression that includes the names of all the products. I want to find word
OR excel
OR outlook
OR powerpoint
. The pipe character, |, is used to represent OR logic. The following expression will return true if any of the items occur in the data:
word|excel|outlook|powerpoint
And here are the results:
word 2007
microsoft excel
outlook express
powerpoint
Tip #3: If in Doubt, Escape it Out!
The dangerous thing about regular expressions is that we often don’t know what we don’t know. There are a lot of characters that have special meaning in reg ex. The plus sign, the question mark and the period are just a few. Inadvertently using a special character in an expression can lead to big trouble. There is an easy way to protect yourself: escaping.
Escaping a character means that GA will interpret the character as a LITERAL character and not as a regular expression character. To escape any character place a backslash in front of the character. Here’s the great part. It doesn’t matter if you escape a non-special character. To me, escaping a character is like using a safety net. If you’re unsure if a particular character is a special character, escape it. It can’t hurt your expression.
Time for an example. Let’s say we want to create a goal based on the following URL:
index.php?id=34
I need to turn the above into a regular expression. The question mark and period are special characters so they need to be escaped. But I’m not sure about the equal sign. I better escape just to be safe. So here’s how the resulting reg ex would look: index\.php\?id\=34
. By the way, the equal sign is not a special character.
So there you have it. My two cents on regular expressions. These tips just scratch the surface of what you can do with Reg ex. If you really want to learn about reg ex check out my friend Robbin’s series on the subject.
At *least* $0.03 surely! ;-)
Great intro Justin to the weird and wonderful world of RE’s!
FWIW, there are cases where escaping characters can get you into trouble, eg inside []. But for an intro article? Perfectly fine!
Cheers!
– Steve
Thanks Steve! Always great to hear from you!
You’re absolutely right about escaping characters inside of brackets and I went back and forth trying to decide if I should make a note of that. In the end I wanted to keep this simple.
Have a great day,
Justin
Thanks for posting this, Justin. If there’s one thing I wish I knew more about, it’s regular expressions. I think it was my new year’s resolution last year, and here we are at 2008 already! Maybe 2008 will be the year. Good intro, it’s helpful for me.
Tyson
Thanks Tyson, glad you found the post useful. Good luck with your 2008 resolution!
Justin
Hey Justin – Great one. My coworkers are always trying to get me up to speed on reg ex and sometimes you just gotta take it one step at a time! Thanks for the post.
You just wanted to show off your son’s toys, didn’t you? And his little hands too.
“It doesn’t matter if you escape a non-special character.”
It’s worth noting that this isn’t a safe assumption for alphanumeric characters. For example ‘n’ is not equivalent to ‘\n’, as the latter is interpreted as a new-line character.
(correct me if this isn’t the case in GA)
Paul,
Thanks for pointing that out. You are absolutely correct.
Justin
Would welcome a regex 102 post. Short of taking a full scale course, regexes seem like the kind of thing best picked up incrementally. Feel free to take it another level deeper.
Fantastic Justin, I thought regular expressions were far more complex than this, turns out they are quite simple and thanks to your article I now feel I understand them.
I’m going to give these a shot :).
I am a scripting trainer. Part of my sessions include regular expressions. My students find that biterscripting is the easiest way to learn regular expressions. Biterscripting is free, it does not require anything else (compiler, etc.), it works on any windows version – so my students can download it on their home computers and start experimenting with regular expressions right away. Biterscripting provides many REs and stream editors (inserter, appender, enumerator, alterer, extractor by entire file, string, line, word, character) so that students can do more sophisticated things than just parsing. I belive a free download is available at http://www.biterscripting.com .
Sen
why did u stop?? I just started enjoying your class. Great work!