How and When to Use Regex in Alteryx

For many, having to use RegEx in their workflow is good reason to burn it all down. Hopefully this blog can make that process a little easier. However, this is not a blog about the rules of RegEx itself - there are plenty of blogs on that topic already (and even some about how to avoid it altogether). This is guide for when to use RegEx in Alteryx workflows, and how to pick which tool to use.

When Not to Use RegEx

There are many times when one of the other tools in the Parse tool palette will do just as well, and often make your workflow clearer to read. If a list is separated by a single consistent delimiter, Text To Columns is quicker to use that writing a RegEx expression (I know this is like writing ATM machine but what are you gonna do). If you have dates that in a weird string format, the DateTime tool is actually surprisingly flexible in terms of entering a custom matching format. See this blog for more information there. There are a ton of text processing options in the Data Cleansing tool, from replacing whitespace or specific digits to changing the case of strings. In addition, a lot of text processing that could seem like it requires RegEx can be handled by basic string functions. Contains() and FindString() are great for comparing strings to one another. Contains() returns true or false and FindString() (which is case sensitive!) returns an index number for the start of the match, or -1 if it is not found. Replace() and ReplaceChar() (both case sensitive) can do almost everything the replace function in the RegEx tool can, and Trim() is useful for getting rid of unwanted characters.

When to Use RegEx

There are some times when RegEx is essential, as in the following cases I've found.

Pulling dates out of strings - [PARSE] - when you have dates in the same field that come in different formats RegEx will help you extract the necessary information to create a consistent format (see challenge #4).

Pulling addresses out of strings - [PARSE] - address information is generally going to be in different formats and sometimes place names include spaces or dashes. RegEx will be much more flexible in this situation, and simpler than lengthy string formula rules (see challenge #54).

Breaking text into columns - [PARSE] - only in select circumstances, like if a delimiter is more than a single character long or variable in some way (see challenge #211).

Breaking text into rows - [TOKENIZE] - if you need to break the string into one row for every single character (see challenge #128).

Working with data in a matrix-like structure - [TOKENIZE] - in conjunction with various pivoting tools (see challenge #200).

Extracting the data from downloaded webpages or JSON - [PARSE, TOKENIZE] - text in this format is usually super complicated and you will have to rely on using HTML tags to grab the actual data in between (see challenge #40 or challenge #227).

Extracting data out of poorly formatted spreadsheets or reports - [PARSE, TOKENIZE] - there can be any number of reasons you will come across bad formatting, like outdated software or a piece of non-tabular data being forced into the shape of a spreadsheet (see challenge #22, challenge #28, challenge #316).

Cleaning strings based on context - [REPLACE, PARSE] - things like removing duplicate words, or adding prefixes or suffixes to certain words (see challenge #130).

Finding out if a string contains a substring - [REGEX_Match(), REGEX_CountMatches()] - only in certain circumstances, like if you want to compare multiple strings at a time, or find out how many times a substring occurs within a string (challenge #296 is an OK example of this).

Troubleshooting / Tips & Tricks

The Tokenize option of the RegEx tool is essentially the opposite of the Text To Columns tool. In Text To columns Alteryx asks you to specify a delimiter on which to split text, while in Tokenize you use RegEx to describe the text you want to extract, not the delimiter.

REGEX_Match() has to match the entire string, not just a part of it. This one has gotten me many times. Consider it the same as using the formula "abcdef" = \regex\. The Replace option of the RegEx tool operates the same way.

Many string functions are case-insensitive by default. You can turn them case-sensitive by adding a 0 as an optional third argument. These are Contains(), REGEX_CountMatches(), REGEX_Match(), and REGEX_Replace().

Author:

Adam Sultanov

View Profile