We had been informed in advance that today we would learn the very beginning of a quite complicated topic, regular expression – a sequence of characters that specifies a search pattern in text. The pattern rule of regex is of great importance for doing web scraping later on. To be honest, I was quite shocked at first when I was looking at a long chaotic sequence of special characters which are referred to as wildcards or steroids. You are probably familiar with wildcard notations such as *.csv. One of the most regular usages of regex deals with word extraction such as text, email address. With regard to regex’s enormous applications, this blog aims to crack only a tiny tip of the iceberg 😊, lazy & greedy concept in the context of regular expression.
From the basic lesson you might encounter the following characters:
.* is greedy, meaning that it will ignore the next delimiter of your regex until it itself is not fulfilled, unless the regex following .* is against the end of the target string.
.*? is ungreedy, meaning that it will proceed to the next delimiter of your regex, if the next is fulfilled. It will continue onto the next delimiter even if itself is still applicable.
Let’s begin with an example:
/(.*) dog/ will match I think your dog bit my dog and group 1 will be I think your dog bit my.
/(.*?) dog/ will match I think your dog bit my dog and group 1 will be I think your".
The ? makes the + "lazy" instead of "greedy". This means it tries to match as few times as possible, instead of trying to match as many times as possible.
To make things clear let’s trace the search step by step by another example. The Regex expression /".+?"/g works as intended: it finds "your" and "my":
1. The first step is to find the pattern starting with ‘ “ ‘:
2. The next step is also similar: the engine finds a match for the dot '.'
:
3. And now the search goes differently. Because we have a lazy mode for +?
, the engine doesn’t try to match a dot one more time, but stops and tries to match the rest of the pattern '"'
right now:
4. Then the regular expression engine increases the number of repetitions for the dot and tries one more time:
Failure again. Then the number of repetitions is increased again and again…
5. …Till the match for the rest of the pattern is found:
6. The next search starts from the end of the current match and yield one more result:
If the expression would be like /".+"/ which mean “greedy”. The result match would be “your” dog bit “my”
That’s a decent bit to learn on Friday, isn’t it?