Regular Expressions

An introduction to regular expressions, and how to use them in C++ with std::regex, std::regex_match, and std::regex_search
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
Abstract art representing computer programming
Ryan McCombe
Ryan McCombe
Updated

Regular expressions, often referred to as "regex" or "regexp", are powerful tools that allow for the search and manipulation of text. They are used to detect if a string meets specific requirements, to extract information from text, or to change text in specific ways.

Some examples where they’re useful include:

  • Validating user input - for example, ensuring a string looks like an email address, phone number, or another type of data we were expecting
  • Syntax highlighting - your code editor, or the code snippets displayed in our lessons, use regular expressions to change the color and formatting of characters to make our code more readable
  • Data extraction - for example, if we want to create a system that reads through text and extracts anything that looks like a date
  • Redaction - for example, automatically removing contents from a document that could be sensitive

In this lesson, we’ll introduce the syntax for creating a regular expression, and show how to use it to determine if our C++ string matches the given pattern. In the next lesson, we’ll expand this to cover how we can use a regex pattern to extract and replace parts of our content in a targeted way.

Regular expressions can be quite challenging to understand at first. Using them to look for complex patterns or solve larger tasks can be tough. However, working with text is ubiquitous in programming, and regular expressions are the tool we reach for when our task gets a little more complex.

Additionally, regular expression syntax adheres to broadly similar standards across all of programming - not just C++. As such, once we’re familiar with them, we can use them regardless of what programming language we’re working in.

Standalone Regex Tools

When we’re writing more complex regular expressions, it’s common to craft them with the help of a standalone tool. Those tools will explain what our expression is doing, and allow us to quickly test it against a range of inputs. Many free web-based tools are available, such as the RegExr

Once we’ve created our expression in such a tool, we can copy and paste it into our code. This tends to be much faster than trying to create it in our code editor from scratch.

Using Regular Expressions in C++

Regular expression functionality in the C++ library is available by including the <regex> header:

#include <regex>

Regular expressions, sometimes called patterns, are themselves strings. The standard way to create a regex pattern in C++ is through the std::basic_regex type, which is aliased to std::regex:

std::regex Pattern{"hello world"};

The two most common functions we have for running regular expressions are std::regex_match() and std::regex_search().

In their most basic usage, they accept two arguments - the string we want to test, and the regex we want to use:

#include <regex>
#include <iostream>

int main(){
  std::string Input{"hello world"};
  std::regex Pattern{"hello"};

  bool MatchResult{
    std::regex_match(Input, Pattern)};

  bool SearchResult{
    std::regex_search(Input, Pattern)};
}
  • std::regex_match() returns true if the entire input string matches the pattern
  • std::regex_search() returns true if part of the input string matches the pattern

Below, we run the regex pattern hello on the string hello world. The regex_match() call will return false because the string doesn’t match the entire pattern. The regex_search() call will return true because a substring within the input that did match the pattern was found:

#include <iostream>
#include <regex>

int main() {
  std::string Input{"hello world"};
  std::regex Pattern{"hello"};

  bool MatchResult{
      std::regex_match(Input, Pattern)};

  bool SearchResult{
      std::regex_search(Input, Pattern)};

  std::cout << "The regex_match pattern "
            << (MatchResult ? "did" : "did NOT")
            << " match";

  std::cout << "\nThe regex_search pattern "
            << (SearchResult ? "did"
                             : "did NOT")
            << " match";
}
The regex_match pattern did NOT match
The regex_search pattern did match

Case-Insensitive Regex Patterns

When creating our pattern using the std::regex constructor, we can provide a second argument. This is where we can pass flags that modify the behavior of our expression. Most of these flags are for advanced use cases that we won’t cover here.

However, there is one that is simple to understand and commonly useful: std::regex::icase flags our expression as being case insensitive. The following pattern searches for "hello", "HELLO", "Hello", "hElLo" and any other variation:

std::regex Pattern{"hello", std::regex::icase};

Below, we search for hello. Our string doesn’t contain hello, but it does have Hello:

#include <iostream>
#include <regex>

int main() {
  std::string Input{"Hello world"};
  std::regex SensitivePattern{"hello"};
  std::regex InsensitivePattern{
      "hello", std::regex::icase};

  bool SensitiveResult{std::regex_search(
      Input, SensitivePattern)};

  bool InsensitiveResult{std::regex_search(
      Input, InsensitivePattern)};

  std::cout << "The sensitive pattern "
            << (SensitiveResult ? "did"
                                : "did NOT")
            << " match";

  std::cout << "\nThe insensitive pattern "
            << (InsensitiveResult ? "did"
                                  : "did NOT")
            << " match";
}
The sensitive pattern did NOT match
The insensitive pattern did match

Raw String Literals

In the next section, we’ll begin to see more complicated regular expressions, that include a lot of special characters. This particularly includes the backslash character, \, which needs to be escaped in standard C++ string literals.

These additional escape characters can make regular expressions even more difficult to follow. So, for more complex expressions, it is recommended to construct our patterns from raw string literals.

Whilst string literals begin and end with ", raw string literals begin with R"( and end with )":

// String Literal
std::regex PatternA{"hello"};

// Raw String Literal
std::regex PatternB{R"(hello)"};

Special Characters

With regular expressions, we’re not just restricted to simple character matching. We have the option to add more complex syntax into our pattern, to create more elaborate behavior.

Regex Order of Operations

Within the same pattern, we can combine as many of these special characters as we want. Like other programming expressions, how they combine is subject to an order of operations that is not entirely intuitive.

Few people learn the order of operations. Instead, we just use the dedicated regex tools talked about previously to see what works and what doesn’t.

In the next lesson, we’ll learn how to manipulate the order of operations within regex by using groups

Below, we cover the most common special characters:

Wildcard: .

The period character . acts as a wildcard, matching any character within a string. For example, the regex pattern c.t will match any three-letter sequence that starts with "c" and ends with "t":

✔️ c.t -> cat
✔️ c.t -> cut
✔️ c.t -> cot

It only matches a single character, so a pattern like c.t will not match a string like cart :

❌ c.t -> cart
❌ c.t -> colt
❌ c.t -> carrot

But we can use multiple . tokens:

✔️ c..t -> cart
✔️ c..t -> colt
❌ c..t -> carrot
✔️ c....t -> carrot

We’ll see better ways of matching multiple characters later in this lesson.

Alternation: |

The vertical bar character, | allows the input to match either the left or the right pattern:

✔️ cat|hat -> cat
✔️ cat|hat -> hat
❌ cat|hat -> tat

✔️ cat|h.t -> hit

Optional: ?

The question mark character, ? flags the previous symbol as being optional:

✔️ can?t -> cant
✔️ can?t -> cat

✔️ ca.?t -> cart
✔️ ca.?t -> cat

Start of input: ^

The caret symbol ^ denotes the start of the input, allowing us to match patterns that only at the beginning of our string:

✔️ ^cat -> cat
❌ ^cat -> the cat

✔️ ^cat|the -> the

✔️ ^.at -> mat
❌ ^.at -> the mat

End of input: $

Conversely, the dollar symbol $ denotes the end of the input, which allows us to restrict our search just to the end of the string:

✔️ cat$ -> cat
❌ cat$ -> the cat sat

✔️ cat|mat$ -> the cat
✔️ cat|mat$ -> sat on the mat

✔️ ca.$ -> the can

Earlier, we noted the C++ standard library has the regex_match() function, which matches the entire input, whilst regex_search() looks for substrings. Many programming languages don’t draw a distinction here - they just offer the equivalent of regex_search().

However, using the ^ and $ symbols allows us to create a pattern that restricts our pattern to matching against the entire input anyway:

✔️ ^cat$ -> cat
❌ ^cat$ -> cats
❌ ^cat$ -> the cat

✔️ ^.at$ -> cat
✔️ ^.at$ -> mat
❌ ^.at$ -> mate

Escaping Special Characters

Often, we want to use one of the special characters as their literal meaning within our patterns. For example, imagine we want to check for a literal period within our input. However, . denotes a wild card, so adding it to our regex will match any character.

We can escape characters using the backslash symbol: \. So, if we wanted to check for a literal . , our regex would use \.

Unescaped . is a wild card
✔️ sat on the mat. -> sat on the mat.
✔️ sat on the mat. -> sat on the mat and
✔️ sat on the mat. -> sat on the mate

Escape a special character using \
✔️ sat on the mat\. -> sat on the mat.
❌ sat on the mat\. -> sat on the mate
❌ sat on the mat\. -> sat on the mat

Unescaped ? is the optional symbol
✔️ hello? -> hello
✔️ hello? -> hell

Escape it using \
❌ hello\? -> hello
❌ hello\? -> hell
✔️ hello\? -> hello?

Unescaped $ denotes end of input
❌ $3.50 -> $3.50

Unescaped . is a wild card
✔️ \$3.50 -> $3.50
✔️ \$3.50 -> $3-50

Escaping both special characters
✔️ \$3\.50 -> $3.50
❌ \$3\.50 -> $3-50

When we want to search for a literal \ in our input, the escape character itself can be escaped with an additional \:

❌ yes\no -> yes\no
✔️ yes\\no -> yes\no

To search for literal \\ we escape both
❌ \\user\\files -> \\user\files
✔️ \\\\user\\files -> \\user\files

Character Sets / Character Classes

When we want to search for one of a range of possible characters, we can introduce a character set, sometimes also called character class. We do this by wrapping our characters in [ and ]:. The following searches for bat, cat, mat or rat:

std::regex Pattern{R"([bcmr]at)"};

The order of characters within the set doesn’t matter.

Character sets interact with surrounding special characters as expected. For example, we can check if our input starts with something in a character set using the start-of-input symbol ^, or make the entire set optional using the optional symbol ?:

✔️ ^[Tt]he -> the cat
✔️ ^[Tt]he -> The cat
❌ ^[Tt]he -> Not the cat

✔️ [cbm]?at -> at
✔️ [cbm]?at -> cat
✔️ [cbm]?at -> bat
✔️ [cbm]?at -> mat

Within the [ and ] boundary of a character set, the special characters ., ?, ^, $, and | revert to their literal values. For example, the period symbol . matches only a literal . in the input:

✔️ cat[.] -> cat.
❌ cat[.] -> cats

Character Set Ranges

We can specify numeric or alphabetic ranges within our character sets using -. For example, [a-e] will match any of a, b, c, d, or e:

✔️ [a-h]am -> cam
✔️ [a-h]am -> ham
❌ [a-h]am -> ram

✔️ [a-z]am -> ram
❌ [a-z]am -> 9am

❌ [0-9]am -> ram
✔️ [0-9]am -> 9am

✔️ [0-9a-z]am -> 9am
✔️ [0-9a-z]am -> ram

Hexadecimal value
✔️ [A-F0-9][A-F0-9] -> FF
✔️ [A-F0-9][A-F0-9] -> E4
❌ [A-F0-9][A-F0-9] -> G4

Some character sets have shortcut symbols we can use instead:

  • \d for any numeric digit, equivalent to [0-9]
  • \w for any alphabetic character, digit, or underscore, equivalent to [a-zA-Z0-9_]
  • \s for any white space (can be a space character, a line break character, a tab character, and so on)
\d can be any digit
✔️ \d -> 9
❌ \d -> m
✔️ \dam -> 9am
❌ \dam -> dam
❌ \dam -> ram

Any two digits
✔️ \d\d -> 10
❌ \d\d -> 1

Making the second digit optional
✔️ \d\d? -> 10
✔️ \d\d? -> 1

\w can be any letter, digit, or underscore
✔️ \w -> m
✔️ \w -> 9
✔️ \wam -> ram
✔️ \w\wam -> roam
✔️ help\w -> helps
❌ help\w -> help!

\s can be any whitespace
✔️ the\sfox -> the fox

The \n here is a line break
✔️ the\sfox -> the\nfox

Any whitespace followed by any letter
✔️ the\s\wat -> the cat
✔️ the\s\wat -> the mat
❌ the\s\wat -> themat

Making whitespace optional
✔️ the\s?\wat -> themat

Combining \d \s and \w
❌ \d\d\s\wats -> 1 cat
❌ \d\d\s\wats -> 5 cats
✔️ \d\d\s\wats -> 05 cats
✔️ \d\d\s\wats -> 24 rats
✔️ \d\d\s\wats -> 24\nrats
❌ \d\d\s\wats -> 100 cats
❌ \d\d\s\wats -> four cats

Making the second \d and final \s optional
✔️ \d\d?\s\wats? -> 1 cat
✔️ \d\d?\s\wats? -> 5 cats
✔️ \d\d?\s\wats? -> 05 cats
✔️ \d\d?\s\wats? -> 24 rats
✔️ \d\d?\s\wats? -> 24\nrats
❌ \d\d?\s\wats? -> 100 cats
❌ \d\d?\s\wats? -> four cats

Escaping Character Sets

Character sets, and their shortcuts, can be escaped in the usual way, with \. For example, if we wanted our regex to search for the [ character, we’d escape it as \[

Searching for the literal [hello]
❌ \[hello\] -> h
✔️ \[hello\] -> [hello]

Searching for literal \w
❌ \\w -> a
✔️ \\w -> \w

Searching for literal \d
❌ \\d -> 5
✔️ \\d -> \d

Searching for literal \s
❌ \\s -> the cat
✔️ \\s -> \s

Negating Character Sets

By including the caret symbol ^ at the beginning of our character set, we can negate it. This allows us to ensure a set of characters is not included in our input at that position.

Searching for "at" not preceded by c, b or m
✔️ [^cbm]at -> at
❌ [^cbm]at -> cat
❌ [^cbm]at -> bat
❌ [^cbm]at -> mat
✔️ [^cbm]at -> rat

Searching for "at" not preceded by anything from c-m
✔️ [^c-m]at -> rat

Searching for "at" not preceded by a digit
✔️ [^\d]at -> cat
❌ [^\d]at -> 5at
✔️ [^\d]at -> 5 at

Searching for "cat" not followed by an alphanumeric character
✔️ cat[^\w] -> cat
✔️ cat[^\w] -> the cat sat
❌ cat[^\w] -> caterpillar
❌ cat[^\w] -> vacate
✔️ cat[^\w] -> copycat

Searching for "cat" not preceded or followed by an alphanumeric character
❌ [^\w]cat[^\w] -> copycat

Repetition

We can look for repeating patterns within our input. We do that by adding syntax directly after the symbol or character set we want to look for repetitions of. We have several options:

Zero or More: *

The * character states there can be any number of the proceeding symbol or character set. That can include zero:

✔️ ab*c -> ac
✔️ ab*c -> abc
✔️ ab*c -> abbc
✔️ ab*c -> abbbbbc

✔️ a.*c -> ac
✔️ a.*c -> abc
✔️ a.*c -> a123c
✔️ a.*c -> a123 abc

✔️ a[bcd]*e -> ae
✔️ a[bcd]*e -> abe
✔️ a[bcd]*e -> ace
✔️ a[bcd]*e -> abcde
✔️ a[bcd]*e -> abcdcdbe
❌ a[bcd]*e -> a1e

One or More: +

The + character specifies we want at least one of the proceeding symbol or character set, but there can be more:

❌ ab+c -> ac
✔️ ab+c -> abc
✔️ ab+c -> abbc
✔️ ab+c -> abbbbbc

❌ a.+c -> ac
✔️ a.+c -> abc
✔️ a.+c -> a123c
✔️ a.+c -> a123 abc

❌ a[bcd]+e -> ae
✔️ a[bcd]+e -> abe
✔️ a[bcd]+e -> ace
✔️ a[bcd]+e -> abcde
✔️ a[bcd]+e -> abcdcdbe
❌ a[bcd]+e -> ale

Exactly x repetitions: {x}

The brace syntax allows us to be more specific with how many repetitions we want. We can pass a single number between the braces, to specify we want a specific number of repetitions.

In the following example, we look for exactly two repetitions:

❌ ab{2}c -> ac
❌ ab{2}c -> abc
✔️ ab{2}c -> abbc
❌ ab{2}c -> abbbc

❌ a.{2}c -> ac
❌ a.{2}c -> abc
✔️ a.{2}c -> a12c
❌ a.{2}c -> a123c

❌ a[bcd]{2}e -> ae
❌ a[bcd]{2}e -> abe
❌ a[bcd]{2}e -> ace
✔️ a[bcd]{2}e -> abce
✔️ a[bcd]{2}e -> acde
❌ a[bcd]{2}e -> abcde

256 bit hexadecimal
✔️ [A-F0-9]{2} -> 6F

hexadecimal color
✔️ [A-F0-9]{6} -> 6F4AFF

Note, that a common mistake here comes when using this syntax alongside a substring search, like std::regex_search(). In that context, a pattern like [0-9]{2} which searches for exactly 2 digits will return true on an input like 123. Whilst the entire string of 123 has 3 digits, it has two substrings of 2 digits - 12 and 23.

If we wanted this input to not match, we’d need to be more specific. For example, if we wanted our entire string to be exactly 2 digits, we could use std::regex_match(), instead of std::regex_search(), or add the start and end of input symbols ^ and $ to our regex.

At least x repetitions: {x,}

By adding a trailing comma within our braces, we specify that we want at least x repetitions, without an upper limit. Below, we search for at least two repetitions:

❌ ab{2,}c -> ac
❌ ab{2,}c -> abc
✔️ ab{2,}c -> abbc
✔️ ab{2,}c -> abbbc

❌ a.{2,}c -> ac
❌ a.{2,}c -> abc
✔️ a.{2,}c -> a12c
✔️ a.{2,}c -> a123c

❌ a[bcd]{2,}e -> ae
❌ a[bcd]{2,}e -> abe
❌ a[bcd]{2,}e -> ace
✔️ a[bcd]{2,}e -> abce
✔️ a[bcd]{2,}e -> acde
✔️ a[bcd]{2,}e -> abcde

From x to y repetitions: {x, y}:

By adding a second number to our braces, we can specify both a lower and upper range for the number of repetitions we are looking for. Below, we search for at least one, but not more than three repetitions of our previous symbol or character set:

Looking for 1 to 3 repetitions
❌ ab{1,3}c -> ac
✔️ ab{1,3}c -> abc
✔️ ab{1,3}c -> abbc
✔️ ab{1,3}c -> abbbc
❌ ab{1,3}c -> abbbbc

❌ a.{1,3}c -> ac
✔️ a.{1,3}c -> abc
✔️ a.{1,3}c -> a12c
✔️ a.{1,3}c -> a123c
❌ a.{1,3}c -> a1234c

❌ a[bcd]{1,3}e -> ae
✔️ a[bcd]{1,3}e -> abe
✔️ a[bcd]{1,3}e -> ace
✔️ a[bcd]{1,3}e -> abce
✔️ a[bcd]{1,3}e -> acde
✔️ a[bcd]{1,3}e -> abcde
❌ a[bcd]{1,3}e -> abbcde

Repetition with Character Set Shortcuts

The repetition specifiers also work with the character set shortcuts \s, \w, and \d:

❌ the\s+cat -> thecat
✔️ the\s+cat -> the cat
✔️ the\s+cat -> the    cat

✔️ c\w*t -> ct
✔️ c\w*t -> cat
✔️ c\w*t -> coat
❌ c\w*t -> c5t

❌ \d{2} -> 1
✔️ \d{2} -> 10

Substring match
✔️ \d{2} -> 100

Full string match
❌ ^\d{2}$ -> 100

❌ \d{2,} -> 1
✔️ \d{2,} -> 10
✔️ \d{2,} -> 100

✔️ \d{1,3} -> 1
✔️ \d{1,3} -> 10
✔️ \d{1,3} -> 100

Substring match
✔️ \d{1,3} -> 1000

Full string match
❌ ^\d{1,3}$ -> 1000

In the next lesson, we’ll cover regular expression capture groups. These build on the regex syntax we learned but will allow us to go beyond just checking whether or not the text matches the regular expression pattern we created.

With capture groups, we will learn how to use regex to filter through and extract parts of our text, or to replace segments of it in a targeted way.

Summary

In this lesson, we've explored the fundamentals of using regular expressions in C++, covering how to create patterns, search and match strings, and apply modifiers for case-insensitivity.

Main Points Learned:

  • Introduction to regular expressions and their applications.
  • Usage of the <regex> header and std::regex for creating regular expression patterns in C++.
  • The difference between std::regex_match() and std::regex_search(), and examples of their basic usage.
  • How to make regex patterns case-insensitive using std::regex::icase.
  • Utilizing raw string literals for creating complex regex patterns without excessive escaping.
  • Understanding and applying special characters in regex, such as the wildcard (.), alternation (|), and optional (?) characters.
  • Using anchors (^ for start of input, $ for end of input) to specify where a pattern should match.
  • Escaping special characters in regex patterns to use their literal meaning.
  • Defining and using character sets and ranges within regex patterns to match specific groups of characters.
  • Employing repetition specifiers (``, +, {x}, {x,}, and {x, y}) to match repeating patterns.
  • The significance of escaping character sets and shortcuts (\d, \w, \s) in regex, and how to negate character sets with ^.

Was this lesson useful?

Next Lesson

Regex Capture Groups

An introduction to regular expression capture groups, and how to use them in C++ with regex search, replace, iterator, and token_iterator
Abstract art representing computer programming
Ryan McCombe
Ryan McCombe
Updated
A computer programmer
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
Strings and Streams
Next Lesson

Regex Capture Groups

An introduction to regular expression capture groups, and how to use them in C++ with regex search, replace, iterator, and token_iterator
Abstract art representing computer programming
Contact|Privacy Policy|Terms of Use
Copyright © 2024 - All Rights Reserved