Using Regex to find key information from a free input text field

by Robert Headington

On Tuesday, we learnt what Regex is and how we can use this in Tableau, Tableau Prep and Alteryx. So when I saw that Regex could be used to help solve this week’s Preppin data challenge, I saw this as a great chance to practice using Regex and my Tableau Prep skills at the same time.

Regex is a logical coding language which you can use to pick out patterns or information in strings. This week’s Preppin data challenge was to find key information from a free input text field where the users can put anything they want, something that fits in well with Regex.

Thinking about the problem

But before we should think about how we should apply Regex, we need to know what the key information is in the data and how it is structured in the free text field. In our case there were a few things we needed:

  • The policy numbers
  • If the note was about a call or an email
  • If it was a complaint, a balance check or asking about a statement

Now we know what we want, we need to know how this information was structured:

  • The contact centre’s staff always put a # before they inputted the policy number
  • For a call, they used ‘call’ or call. Similarly for an email, it was ‘email’ or ‘emailed’.
  • Staff used both the terms complaint and complaining for complaints. While all notes on balance checks and statement requests used the terms ‘balance’ and ‘statement’ respectfully.

Some Regex code

So now we know what we want and how that information is structured, we can think about how we can use Regex to get these. There are some key things about Regex that you need to know to do this. In the Regex code (the ‘ ‘ are not part of the code, it’s what’s contained within these):

  • Periods ‘.’ can look for any character
  • A plus ‘+’ finds the previous character 1 or more times
  • The code ‘\d’ can look for any digit
  • Any string without any special characters look for that string. For example, ‘Call’ finds the letters Call. I say letters as it is not looking for a word, so ‘Call’ has the ability to find: Call, Called, Calling, I Called Bob today to borrow some soap, Calldsfsgernuvbc, gsgedggfCallvdfhfghf, etc. (some extra code is needed for some of these, covered below). Notice how each of these examples have a capital C. This is because Regex is case sensitive so ‘Call’ will not find call.
  • A line ‘|’ means OR. For example, ‘Call|Email’ will find Call, Called, Email, Emailed, etc.
  • Brackets ‘( )’ groups the terms inside the brackets

There is far more Regex coding than this so this doesn’t enable you to fully understand the power that regex can give a user

but this is all we need for this particular challenge.

Finding Policy numbers

So how can we use this to solve this challenge? We can get the policy numbers with the following code:

REGEXP_EXTRACT([Notes],’.+#(\d+)’)

In simple terms, this tells tableau to find all the digits after the first #. Let’s break this down:

  • REGEXP_EXTRACT is a Tableau function that lets you use Regex code to extract some information out of a text field
  • [Notes] is the field name of the string I want to extract information form.
  • You need to put your code inside ‘ ‘ in Tableau, this is not part of the code itself
  • .+# means that the code will look for any characters until it finds a #, then it will stop. In other words, it will find the first #
  • \d+ means that after the # it found, it will look for the digits after the # until there are no digits left. For example, #4645hg75 would find 4645 as it would stop when it sees a non-digit (in this case ‘h’). Similarly, #58424 866 would find #58424 as the space is not a digit.
  • (\d+) tells Tableau to extract whatever’s inside the brackets. In this case all the digits after the #.

Is it a call or an email?

To find if the note relates to a call or an email, we can use the following code:

REGEXP_EXTRACT([Notes],'(Call|Email|call|email)’)

This will basically find the first instance in the ‘notes’ field where one of the 4 options listed above is found and then returns this string as the result.

As explained above, this will find any of these words with a suffix (like ‘ed’). In this case, I didn’t need to use the lower-case version and could have got away with REGEXP_EXTRACT([Notes],'(Call|Email)’) as all the notes had capital letters for the corresponding words.

Was the note about a balance check, getting a statement or raising a complaint?

You can do a very similar thing to find out what the call was about:

IFNULL(REGEXP_EXTRACT([Notes],'(balance|complain|statement|Balance|Complain|Statement)’),’Other’)

This will find the first instance where any of the above was found and return it. If none of these were found, then it will return ‘Other’ (from the IFNULL function). Although for the code you must specify the Case of the letters, Tableau groups the results together as the same result. So all of my ‘statement’ results were given the same output as my ‘Statement’ results (in this case they were both put as statements). This was useful as I didn’t care if the word had a capital letter or not.

Some limitations/problems/potential adjustments with this code

For the challenge given, the code works perfectly. However, there are somethings you would want to consider as you get more data points:

  • If a note was regarding two or more different topics (like a compliant and a balance check) the code will only pick up the first time one of these were mentioned. This would be something you would have to consider going forward as you’d want to pick all of these up.
  • Mis-spelt words would not get picked up, you would have to update the code to pick these up.
  • If they stop using # before the customer number, you would need to update this logic.

But these are all things that you could look to fix using more complex regex code.