The basics of regex

Given that I have an upcoming learn what the Dataschool learns session in which I need to design a lesson teaching web scraping in Alteryx, I needed to learn some basics of regex in order to do this so I thought it would be worthwhile to write a blog on what I've learned.

Regular expressions, or regex for short, are a powerful tool used to search for and manipulate patterns in text. They are commonly used to extract specific pieces of information from a long string, check if a piece of data is in a valid format (such as checking if a string is in the proper format for an email address), replace parts of a string with a different string (such as correcting misspellings), and break up a string based on a delimiter.

A regular expression is a sequence of characters that defines a search pattern. It provides a flexible way to match strings of text and is used whenever you need to search for a specific pattern within a string. When using regex for data preparation, you always provide the string to search and the pattern to search for.

There are two types of characters in a regex pattern: standard characters and special characters. Standard characters are those that match exactly what they are, such as letters and numbers. Special characters, on the other hand, alter the logic of the search pattern and do not match themselves.

One example of a special character is the dot (.), which matches any character. For example, if the search pattern is "h.llo", it would match "hello", "h4llo", and "h.llo", but not "hi", "heello", or "Hello". The backslash () is another special character that is used to escape other special characters and make standard characters special. For example, if the search pattern is "h.llo", it would match "h.ello", but not "hi", "hello", "h.llo", or "H\ello".

There are also shorthand characters, which are used to represent common groups of characters. These include \d for any digit, \w for any alphanumeric character and underscore, and \s for any whitespace character (including tabs and line breaks). There are also negated shorthand characters, such as \D for any non-digit, \W for any non-alphanumeric character, and \S for any non-whitespace character.

Square brackets can be used to define a set of values that could be in a certain position in the search pattern. This set is known as a character set and can include ranges of characters. For example, if the search pattern is "H[ea]llo", it would only match "Hello" and "hallo". The hyphen (-) can also be used to specify a range of characters, such as "H[a-z]llo", which would match "hallo", "hbllo", "Hcllo", and so on, but not "HEllo". It's important to note that regex is case sensitive, so "H" and "h" are treated as different characters.

Sometimes the number of times a character occurs in the search pattern is important. To account for this, we can use quantifiers after the character in question. These include the star (*), which matches zero or more of the preceding character, the plus (+), which matches one or more of the preceding character, and the question mark (?), which matches zero or one of the preceding character. We can also use curly braces ({}) to specify a specific quantity or range of the preceding character. For example, {1,3} would match a range of 1 to 3, {3} would match exactly 3, and {3,} would match 3 or more.

Regular brackets can also be used to group multiple characters together, which can then be modified with a quantifier. For example, if we want to match "hello hello hello" but not "hello", our search pattern could be "(hello\s){3}".

The best place to practice regex is regexr.com, I've found this a very helpful resource.

Many thanks to Etienne Soubes-Goldman for this tremendously helpful video!

Author:

Lucas Krokatsis

View Profile