Specificity vs Sensitivity

Before you start throwing rotten tomatoes at me for even discussing such an unlikable topic such as regex, let me justify myself here. I have a condition, it is chronic, and I am afraid there is no cure: I am a gigantic nerd. And so my discussion of this topic should be forgiven as I just cannot contain my nerdiness. If you want to save yourself of learning more about regex or even exposing your precious un-nerdy, cool eyes to such hideousness, I will understand. There is a 'Too Long Didn't Read' version at the very end that you can skip to.

Regex, short for regular expression, or sometimes called rational expressions, is a sequence of characters that specifies a particular pattern in a string. The history of regular expression and the level at which regular expressions exist in computers is fundamental but also above my current pay grade. However, our friend at Computerphile has a great YouTube video with a description from Professor Brailsford on this very topic. From my understanding (and bar in mind, my knowledge of this topic manifested only a week ago), writing a regex expression will feed a set of rules to a automaton, also known as an engine or a finite state machine, which execute the action of searching for the characters that match the regex expression. I may, or may not, be building my own regex engine, after which I should be a little more qualified to write on regex and how it works.

When writing regular expressions, there is a tension between ensuring that that the expression is sensitive enough to capture all patterns that match but specific enough to capture no more than those matches. This is regular expressions' balance between sensitivity and specificity.

For the sake of this blog, I will be writing on something rather more simple - although, to call anything 'more simple' when discussing regex is like calling Ronnie Coleman's left toe 'less muscular'. Examples are possibly the best way to explain this dichotomy.

Example 1: Alteryx Weekly Challenge 56

A fictional company would like to capture how many of each hash tag was used in social media posts. The workflow must determine what hash tags are referenced traced back to individual users and list the details grouped by hash tag. The data is in a jumbled state: the 'text' field containing the hashtags has special characters, numbers, punctuation, spaces, and wild formatting.

Spoilers

It's clear the hashtag starts with, well, a #. So it would be prudent to put a # to start the regex. A set may be useful to follow but a hashtag can be letters and numbers, upper case and lower case, so a set (i.e. [A-Za-z1-9]) could be used but very easily replaced by \w.

So far we have (#\w).

However, more specificity is required to capture the full hashtag and not just the first character after #. The hashtag end is denoted by a space. If a * is used to denote zero or more, sometimes the regex will pick up characters that follow the hashtag and sometimes not. Rather a + will pick up all characters until the space. Regex within Alteryx is greedy, meaning if the quantifier (i.e. + or *) is not followed by a ?, then it will pick up all characters that fit the rule \w until a character that doesn't.

With (#\w+), a tokenise regex tool will pick up all hashtags.

Example 2: Emails

There are many use cases of using a regex to find emails or match emails. For example, a company may want to ensure new sign-ups have used correctly formatted emails. What is synonymous with all emails, is an @ and that's also the place to start.

Example email: bob.thefrog2@gmail.co.uk

So far we have @ which returns only the @.

Applying the same principles as above, what surrounds the @ are characters covered by \w and because there are one or more, it should follow a +.

So far we have \w+@\w+ which returns us thefrog2@gmail.

If we also want to include any character then we can set this with \. which means any character. But we want to have one or more of any of the characters in the set and so the + will need to be added after this set.

So far we have [\.\w]+@[\.\w]+ which returns us bob.thefrog2@gmail.co.uk.

This email regex can get more specific if necessary but for an example, let's leave it there.

TLDR: regex can be specific or sensitive and there is a balance between these attributes when writing an expression.

Author:

Ozlem Sigbeku

View Profile