DS14 spent their first day of this short week with Nick ‘RegEx’ Jastrzebski and Andre, learning about regular expressions in Alteryx and Tableau Desktop & Prep.
If you haven’t heard of regular expressions, check out Bobby’s blog where he explains how and when to use them. In short, RegExes are patterns (sequences of characters) that a regular expression engine uses to match input text. They come in handy when you need to search or parse messy (but somewhat structured) text inputs. To get a better understanding of RegExes, this webpage is really useful – it explains well what RegEx is and let’s you practice. Regex101 and RegExR are also very useful for understanding and developing your regular expressions.
This short blog post aims to explain the four main uses of RegExes in Alteryx – Matching, Parsing, Replacing, and Tokenizing. You can choose which method you want to use by changing the output method in the configuration window of the RegEx tool. You can also use the functions REGEX_CountMatches, REGEX_Match, and REGEX_Replace in the Formula and filter Tools.
RegEx Match
This method of the RegEx tool returns a True or False statement (a Boolean value) depending on whether the searched text matches your regular expression. In the example below, I am looking for those names that start with an M (^M) and do not end with a space ([^\s]$). With this method, we could easily filter out rows we do not want.
RegEx Parse
The Parse method is probably the most common way to use regular expressions. Here, you can extract strings from your original text input into new columns – you can think of it as an extension to the Match method where you were only validating that the input contained your expression.
You will need to put everything you want to extract in parentheses , which will appear in a new column. It is also possible to extract multiple groups (and therefore create multiple columns) using one regex. Below, I extracted the dates and times from a messy string input using two marked groups. The first marked group captured the dates based on the pattern I have given Alteryx, whilst the second group captured the time. Extracting both date and time using one regular expression was only possible because they appeared in the same order in each row.
RegEx Replace
Using the Replace method you can very easily replace parts of a field (determined by your regular expression) with another string (replacement text), and you can also rearrange your string. Using Alteryx’s example below, we can use this method to get rid of parts of our address field and rearrange it. The expression (.*),(.*),(.*)\s(\d{5})(-\d{4})? divides the address field into 4 or 5 groups (the question mark after the last capturing group indicates that the last group is optional), and replaces the whole field with the values (represented by the dollar sign) in the 4th, 2nd, and 3rd group ($4: $2, $3).
RegEx Tokenize
This output method is similar to the Text to Columns tool but gives you much more flexibility: Alteryx searches the string input for each instances of your regex, placing each into a separate column or row. As seen in the example below, this can be very useful when you have one column with names, and you would like to easily split it into Titles, First Names, and Last Names – dropping nicknames. The expression I used looks for all substrings that start with a capital letter ([A-Z]) followed by one or more alphanumeric characters (\w+), meaning that it will not capture the quotation marks and will, therefore, skip the nicknames.
I think we all started this day somewhat confused about how exactly regular expressions were structured, but Nick and Andre showed us that once you understand the logic behind them, they are not too difficult, and are pretty useful (and you always have the cheatsheets!).