What is the RegEx tool?
The RegEx tool in Alteryx is very powerful once you are proficient at using it. RegEx is short for Regular Expression and is a formal language that is used not just in Alteryx but other contexts as well. It allows you to extract just those parts of a field (typically a string) that you are interested in. In many cases these actions can be performed with a Formula tool but the RegEx functions can give you a lot more flexbility and perform actions that the Formula tool can’t Getting your head around how to write commands is a learning process but it really is worth it. I am very much at the beginning of this process and thought I would share some beginner’s insights with those who are in a similar position.
A very helpful website is http://regexr.com/, where you can test your expressions on a text you can specify so you can watch what the RegEx expression is doing. You should also check out http://regexone.com/ if you are interested in learning RegEx as this website goes over the basics step by step and provides some exercises.
You can find a full explanation of how to configure the Alteryx RegEx tool at http://help.alteryx.com/9.5/RegEx.htm. You will need to select a field that your RegEx is addressing, and tell the tool what to do with the result (Output Method). The information that you are identifying with your expression can for example be replaced, or it can be sepated into new columns.
A use case example
A few weeks ago I was working with a source code from which I wanted to extract just a specific number string. I wrote about this in an earlier blog post.
As you can see, I wanted to extract just the numbers from the json identifier. The final expression I used was (<a href=”\d{3,4}.json). This might not be the best expression for this purpose (I am still learning) but it worked and I will briefly explain why I used each of the expressions:
- () : The brackets around the whole expression define a capturing group for extraction
- <a href=” : This is the specific string I was looking for in order to just get the first instance of the json number rather than both of the numbers. I could have replaced this with a different string to identify the second instead. I probably would have gotten away with using a shorter string as well (such as just the “, as the second json number is preceded with a >).
- \d. This will match any digit from 0 to 9, so this will get the different json numbers.
- {3,4}: This curly bracket after the \d is optional and specifies how many digits should be matched, in this case 3 to 4. If I had just included a 3 then only those json numbers with three digits would have been matched. If I change it to 5 then it will also account for possible future cases in which the number goes above 9999. Or I could write it as {3,}, which will match everything above 3. When writing out my expressions I try to be as narrow as possible.
- .json : Again, this determines the string that should follow the digits. This part isn’t reallly needed in order to extract this particular number but I actually needed this string to form the final URL so I brought it through as well.
Once I had extracted just this section of the source code that I was interested in I could now replace the first part of the string (<a href=”) with the base URL to form the URL that I needed to get the data I was looking for.