Regular Expressions (RegEx) Made Easy

Dec 22, 2022 | Shashank Kuchibhotla

Regular Expressions have always given me nightmares as I tend to trip over the usage.

Regular Expression is a sequence of characters that works as a search pattern to look for matching characters in any given text. One of the most common uses of these Regular Expressions is to validate registration form data, like username and password as it is supported by most general-purpose programming languages. Here is an example of a Regular Expression to validate usernames of user accounts:

[a-zA-Z0–9]{5–15}

This Regular Expression would accept the following usernames:

adamSandler, NicholasCage123, BRUCEWILLIS

This Regular Expression would reject the following usernames:

12Hi, SylvesterStallone, ca$h

There are two major types of characters that we can use to form a Regular Expression, namely Basic, and Meta Characters.

 

1. Basic Characters

We know that a Regular Expression is just a pattern of characters that we can use to perform text search. Thus, the Regular Expression cat is just the character c followed by character a and then t. So, the Regular Expression cat will match the word cat in the given text:

The cat can see the mouse

A regular expression is matched character by character against the input string.

We also need to remember that the Regular Expressions are case-sensitive. So, the Regular Expression “the” in the same input string will match string “the” but not “The”. Here is the output:

The cat can see the mouse

This was too basic, wasn’t it? Time to turn it up a notch! Let’s now talk about Meta characters…

 

2. Meta Characters

Meta Characters are the basic building blocks of a Regular Expression. Each and every Meta Character has a special meaning when used as part of a Regular Expression, and each Meta Character is interpreted in a different way. Here is a list of the Meta Characters:

 

Let’s look into each Meta Character in detail…

meta characters_alternate

. (Full Stop)

Since the full stop matches any single character except a line break, the regular expression “.r” matches “ar” in the word car in the first example below, but “.at” matches cat, sat, bat in batch, and mat in the second example.

.r  => We bought a car.
.at => The cat sat on the batch mat.

 

[ ] (Character Sets)

[ ], also known as Character Sets, or Character Classes, matches a character, or set of characters specified within the brackets. “[Tt]he” matches both “T” or “t”, followed by he. Thus, it matches both “The” and “the” in the first example below. In the second example, we give a range of characters from “c” to “s” followed by at. So, it matches cat, sat, and mat. But note how this does not match batch since “b” is not in the range of characters from “c” to “s”. In the third example, since now the range of characters is from “b” to “s” which includes character “b”, we match bat of the string batch.

[Tt]he  => The cat can see the mouse.
[c-s]at => The cat sat on the batch mat.
[b-s]at => The cat sat on the batch mat.

 

[^] (Except this character)

Since [^] matches any character(s) except the ones specified in the brackets after the symbol ^, the first example below matches “par” of parked and “gar” of garage, but not car. The second and third examples show how we can omit/skip multiple characters with ^ symbol.

[^c]ar   => The car parked in the garage.
[^cg]ar  => The car parked in the garage.
[^c-h]ar => The car parked in the garage.

 

*

The * symbol matches 0 to infinite occurrences of the character(s) placed within the brackets [ ] preceding it. In the first example below, [a-z]* will match almost all the words in the sentence, but will only match he in the word “The” since “T” is not present in the range “a” to “z”. In the second example, \s represents a white space. Thus, the regular expression means zero or more white spaces followed by characters “c”, “a”, and “t”, and followed by zero or more white spaces. Thus, it will match cat and cat in concatenation.

[a-z]*    => The cat sat on the batch mat.
\s*cat\s* => The cat and bat sat on the concatenation.

 

Similar to *, but with a slight twist, + symbol matches 1 to infinite occurrences of the character(s) (including white spaces) placed within the brackets [ ]. It matches in a greedy manner, which means that it will try to give back as many matching characters as possible. In the first example below, since “.” matches any character, the regular expression will match character “c” followed by at least one character, and then “t”. Thus, it will match cat, sat, and mat in the given string. In the second example, the regular expression matches from cat since it has the first c character in the string, and all the way to …nat in concatenation since the t is the last t in the string.

c.+t => The cat sat on the batch mat.
c.+t => The cat and bat sat on the concatenation.

 

?

The character ? makes the preceding character(s) optional. In other words, it matches zero or one occurrences of the character(s) preceding the ? symbol. In the first example, [T]he matches if “T” is present, with ‘h” and “e” following it. Thus, only “The” is matched in the given string. However, in the second example, [T]?he matches “The” and “he” of “the”.

[T]he  => The cat sat on the mat.
[T]?he => The cat sat on the mat.

 

{m,n}

{ } braces are called quantifiers, and are used to specify the number of times that a character, or a set of characters can be repeated. By specifyingm” and “n”, we provide the lower limit, and the upper limit of the number of times the repetition can happen.

8{0,9}     => 0.88999
[0–6]{2,3} => 0.88999
[0–9]{3}   => 9.9997
[0–9]{2,}  => 9.999999997

In the first example, the regular expression 8{0,9} matches 88 in 0.88999 since the regular expression was looking for the digit 8 repeated a minimum 0 times, and not more than 9 times. In the second example, the regular expression is not able to match anything in the given string since it was looking for the range of numbers 0 to 6 repeated a minimum of 2 times, and not more than 3 times. However, in the third example, we are looking for the digits in range 0 to 9 repeating exactly 3 times. That is why we can identify 999 in 9.9997. However, in the fourth example, by not providing the upper-limit we imply that the upper-limit of the repetition is at infinity. In other words, we are looking for the digits in range 0 and 9 repeating between 2 and an infinite number of times.

 

(abc)

Content placed inside the ( ) parentheses is called a Capturing Group. Using such Capturing Groups, we add other Regular Expression characters to the group of characters.

(ab)* => Addis Ababa

In the given example, the Regular Expression looks for zero or more repetitions of “ab”. Thus, we find “ab” in the given input string.

 

|

The | meta character is known as Alternation and is mostly used within a Capturing Group. This can be used to alternate between multiple expressions within a Capturing Group.

(few|men) => A few good men
([Tt]he|sat) => The cat sat on the mat

In the first example, we are simply looking for either few, or men in the given string. Thus, we can find “few” and “men.” But after looking at this example, you might want to know the difference between Character Sets and Alternation. Character Sets work between characters, whereas Alternation works with expressions. Through the second example, I want to demonstrate the difference between Character Sets, and Alternation. Through “[Tt]he”, we are looking for “T” or “t” followed by he. Thus, we look for both “The” and “the”, and then with the help of the Alternation character, we are looking for “The” and “the”, or sat. Thus, the words “The”, sat, and the are found in the given string.

 

\

The backslash \ character is used to escape the next character. This enables us to look for the characters such as ., {, }, [, ], +, *, and other Regular Expression reserved characters in a given string.

[csm]at\.? => The cat sat on the mat.

With “[csm]at” in the above example, we are looking for cat, sat, or mat, and with “\.?” we are looking for an optional character “.” at the end of either cat, sat, or mat. Thus, we are able to find cat, sat, and mat. in the given string.

We can also form a similar Regular Expression using a Capturing Group in the following way:

(c|s|m)at\.?

This will also result in the same outcome for the given input string “The cat sat on the mat.”

 

^

The ^ character is known as a Caret. This is used to check if the next character is the first character of the given input string.

^The => The cat sat on the mat
^the => The cat sat on the mat
^(T|t)he => The cat sat on the mat

The first example checks if the first character of the input string is T since the caret symbol ^ is present before “T”, and then followed by “h” and then “e”. Since this is true, we are able to find “The”. However, in the second example, we are looking for “t” as the first character followed by “h” and then “e”. But since the first character is “T” and not “t” in the input string, we do not identify anything in the input string. In the third example, we use the alternation character to check if the first character in the input string is either “T” or “t” followed by “h” and “e”. Since this is true as we identify “The” in the input string.

 

$

The dollar symbol $ works like the Caret symbol ^, but instead checks if the preceding character is the last character in the input string.

t$ => The cat sat on the mat
(at)$ => The cat sat on the mat
(at\.)$ => The cat sat on the mat.

In the first example, we are checking if the last character of the input string is “t”, and since this is true, we find “t”. In the second example, we use a Capturing Group to check if the input string ends with a set of characters “at”. Since this is true, we find “at”. In the third example, we search for “at” followed by the optional escape character “\”. Since this is true, we find “at.”.

 

3. Shorthand Characters

We can use the following shorthand characters to look for groups/types of characters in the input string. Here are some of such characters commonly used with Regular Expressions:

shorthand character set_alternate

Conclusion

Though I have not covered all the RegEx related concepts, this article covers most of the basic Regular Expression concepts that should provide you with a suitable foundation to build on, and tackle the majority of the everyday Regular Expression related problems one comes across. Bon voyage enroute to RegEx supremacy!

 

References

[1] https://github.com/ziishaned/learn-regex/blob/master/README.md#learn-regex