RegEx: The Basics

Regular Expression (regex) is used to find a pattern in strings. It is widely popular in the programming languages and can also be used to find something in the text editors. Here, irrespective of the programming language, we will try to learn the basics of how to build a regex to get you started quickly.

Let’s start by different ways of matching string.

Word

We will use below string as an example to find inside it:

Regular Expression (regex) is used to find a pattern in strings

Finding or matching a simple string, you can simply use the word as is.

To find the word pattern , we can simply use below regex:

pattern

Starts/ends with

Starts with uses ^ and ends with uses $ signs to find the match for a line. That means, to find a line that starts or ends with some pattern, we use ^ and $.

To match a line starting with word Regular we can use:

^Regular

To find a line ending with strings, we can use:

strings$

Note: So when you are not using ^ or $ (or any other signs we will learn below) you are telling regex to find a given word anywhere, even inside the line.

Repeating Character ( * and + )

For any character that can repeat itself zero or more times, * is used. Use * just after the character that can repeat it self. To find a word starting with a, following b can repeat it self *zero or more *times.

ab*

It matches aab, abb, abbbbbb etc.

Here, *a* matches the string because * indicates that b can repeat* zero or more* times. So even if b has zero occurrences, then also it matches.

If, we don’t want *b*‘s zero occurrence and want *b*‘s at least one occurrence then use +. + can be used for the one or more occurrence of a character.

ab+

It matches ab, abb, abbb, abbbb etc. But nota.

What if we want *b*‘s zero or one occurrence. There is ? for that. It finds zero or one occurrence of the character.

ab?c

It can match ac, abc etc. But notabbc.

So ? can be used when character is optional.

You might be asking now, What if I want *b*‘s 3 occurrences or 4 or 7 ?

Yes, you can do that with the braces { }. Enclose this braces with a number and it will find that exact occurrences of that character. So, continuing with our example if we want to find *b*‘s 4 occurrences after a, we can use below regex:

ab{4}c

It matches abbbbc, only. So {4} after b tells that b can occur exactly 4 times.

Another question you may ask, What if I want *b*‘s at least 4 occurrences and maximum 10 occurrences?

You can do that by providing a range between the braces. Like, if you want minimum 4 occurrences and maximum 10 occurrences you can write {4,10}.

ab{4,10}c

It matches abbbbc, abbbbbbbc, abbbbbbbbbbc etc. But notabbc or abbbbbbbbbbbc.

Or

If we want to match b or c or d we can simply use | sign between those characters enclosed in brackets:

a(b|c|d)c

It matches abc,acc,adc.

What is the role of brackets here? Brackets here ensure that it is for the second position only. We only want to have b or c or d at the second position. What if we don’t use brackets? So it will become ab|c|dc. It matches or in a whole string like ab or c or dc. So it changes the meaning. It is important to put brackets here.

If we want the word hi or hello, we can use | without any brackets. Like:

hi|hello

It matches the word hi or hello.

Single character

To match any single character we can use .. Yes, exactly dot (.) can be used to find any single character. So if we want a at the start followed by any single character and than* c*, we can use:

a.c

It matches abc, anc, adc, a1c, aoc etc.

Range of characters

Now, if we do not want any single character and want a range of characters, we can use [ ]. How can we define range? For alphabets* a to z* we can use a-z and for numbers 0-9. Here a-z only covers small a-z. For capital letters we can use A-Z. We can enclose this range in square brackets for range.

So, for any small character after a, we use:

a[a-z]c

It matches abc, adc, aec etc. But nota1c, a2c.

We can also combine this ranges. So for any non-digit character we can use:

a[a-zA-Z]c

Here we write a-z and A-Z just after each other. There should not be any space.

Basically, it matches any single character in between square brackets. To find only a,e,i,o,u we can use:

a[aeiou]c

It matches aac,aec,aic etc. But notabc.

Not matches

Sometimes, it is also important to filter something that do not matches a pattern or character. So, for example if we want to find any three letter word that do not start with a vowel, we can write:

[^aeiou]..

Here, adding ^ at the start between square brackets excludes the letters a,e,i,o,u. and matches everything except those. And last two dots says any character. (It includes non-alphabets also).

Congrats! So, that’s the basics of each. Now, how can we combine this? Let’s look at one of the combinations. However, you can use any, and not just below, as per your requirement.

A word starting with a, following by zero or more occurrences of an alphabet and a numeric character.

a[a-zA-Z]*[0-9]

Here, we have used * before the square brackets. So the * is applied for the whole square brackets. And square brackets matches single alphabet. So dividing it, at the first position we have a, fixed. For second position we can have an alphabet (a-z,A-Z) with zero or more occurrences. And for the third position(might be more if it matches more than one on based on the *), it is a numeric. In case second position does not matches the alphabet (as it is having * that tells zero or more occurrences), we will have a numeric at the second position. So it matches aa1, aC5, a9, az8, abcdAbcd2 etc.

Here there might be a question: *How it matches the string abcdAbcd2?* We have a at the first position. That matches the starting a of abcdAbcd2. And at the last, 2 is matched from the [0-9]. So in between, bcdAbcd are matched by [a-zA-Z]*. Now we will divide this in two parts. First, we have [a-zA-Z] which matches any alphabet. Second, we have * that repeats the occurrence of any character. So basically it will not match and repeat the first character match by [a-zA-Z]. That means if b is matched from [a-zA-Z] it does not tell to repeat *b*‘s occurrence zero or more time. It tells to repeat  [a-zA-Z] zero or more times. So each time, we have [a-zA-Z] to match from. First time b is matched, second time c is matched, third time d and so on.

So, now you can combine and use any of the combination to build a basic regex. You can try some variations on different texts and you will understand how it works.

Here is a quick summary:

  • ^: (line) starts with
  • $: (line) ends with
  • *: zero or more (occurences)
  • +: one or more (occurences)
  • ?: zero or one (optional)
  • |: or
  • .: any single character
  • {n}: n occurences
  • {n,m}: min n to max m occurences
  • []: maches any one character in between
  • a-z, A-Z, 0-9 : range
  • [^ ] : not matches a character

So, that’s all. Go on and try some cool regex!

Kirtan Thakkar

Life is all about learning