Regular Expressions (Regex) for complete beginners

When using Alteryx (or tools like excel or python), there are occasions where a human reader can see a clear pattern within strings in a column, but computers struggle to process it. Let's have a look at the example below.

Here is some data regarding customers in the US and their addresses. You may notice that some of the addresses have a 5 digit zip code, which others have a further 4 digits, which includes more local information. How would you find the rows with the additional 4 digits?

You couldn't simply find the last 4 characters in a string using a RIGHT function in the formula tool. After all, every row here ends with 4 numerical characters, regardless of it's zip code format. You could find the last 5 characters, and mark rows by whether those 5 characters begin with a dash. But this could find any row where the 5th-last character is a dash, which could include anomalous rows which don't fit the structure you've seen today.

In a small dataset, the solutions could be designed to fit the data you can see, with little consequence. However, Alteryx is designed as a workflow, to allow users to bring in new data with millions of rows of data. This raises the probability of anomalies, and requires those building workflows to be more precise in what you are looking for.

The Regex tool in Alteryx

Regex, or regular expressions, is built for exactly this sort of problem. It allows the user to confirm a more sophisticated pattern within a string, which can then be conditionally applied to a column within Alteryx. I won't go through every feature of Regex, but there are plenty of cheat sheets and blogs available which can show the range of possibilities.

So how could you solve this problem with Regex? Once you have attached the Regex tool to our data source, you can open the configuration window by clicking on the tool. The first option is which column to apply our expression to, which is correctly set as Address. The second box is where you find our regular expression.

The Regex configuration window in Alteryx

".*-\d{4}" is something you might expect to find after a cat has walked over your keyboard, but each character is purposeful.

The full stop(".") indicates that there could be any character at the start, with the asterisk ("*") suggesting that there could be zero or more of that character. To simplify, ".*" means that there could be any number of characters before the section of the string you care about. This makes sense, as the rest of an address can be any number of characters.

The dash ("-") indicates that, after an unknown number of characters, you expect to find a dash. This is followed by "\d", indicating that the next characters should be digits. "{4}" indicates that there should be exactly 4 digits. Together, this sequence "-\d{4}" suggests that there should be a dash, followed by 4 digits. Given that this is the last part of our expression, this will always be the end of the string.

All together, this mimics how a human would approach the problem. Focusing on the last part of the string, if you find a dash followed by 4 numbers then you can assume that you have the additional 4 digits of a zip code.

The final section of the configuration panel asks how you want to apply this information. In this case, you would want to know if each string matched this pattern, so you can use the "Match" method. This will generate a column containing a Boolean value, which says whether the row matches the value. You can name the match column, and use the tick-box to error if a value doesn't match.

The results pane

Hey presto! You've just used Regex for the first time, congrats! You have a column which correctly shows whether the additional 4 digits of the zip code are present, which can be applied to further analysis.

Regex is a complicated syntax, which is difficult for a beginner to read and write in. However, someone who is experienced in it's use can  apply multiple conditions to a string in a single tool and extract information rapidly. It's a tool that becomes more useful every time you use it, and there's so much more to explore within Regex.

Author:
Jack Arnaud
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab