std::regex
, std::regex_match
, and std::regex_search
Regular expressions, often referred to as "regex" or "regexp", are powerful tools that allow for the search and manipulation of text. They are used to detect if a string meets specific requirements, to extract information from text, or to change text in specific ways.
Some examples where they’re useful include:
In this lesson, we’ll introduce the syntax for creating a regular expression, and show how to use it to determine if our C++ string matches the given pattern. In the next lesson, we’ll expand this to cover how we can use a regex pattern to extract and replace parts of our content in a targeted way.
Regular expressions can be quite challenging to understand at first. Using them to look for complex patterns or solve larger tasks can be tough. However, working with text is ubiquitous in programming, and regular expressions are the tool we reach for when our task gets a little more complex.
Additionally, regular expression syntax adheres to broadly similar standards across all of programming - not just C++. As such, once we’re familiar with them, we can use them regardless of what programming language we’re working in.
When we’re writing more complex regular expressions, it’s common to craft them with the help of a standalone tool. Those tools will explain what our expression is doing, and allow us to quickly test it against a range of inputs. Many free web-based tools are available, such as the RegExr
Once we’ve created our expression in such a tool, we can copy and paste it into our code. This tends to be much faster than trying to create it in our code editor from scratch.
Regular expression functionality in the C++ library is available by including the <regex>
 header:
#include <regex>
Regular expressions, sometimes called patterns, are themselves strings. The standard way to create a regex pattern in C++ is through the std::basic_regex
type, which is aliased to std::regex
:
std::regex Pattern{"hello world"};
The two most common functions we have for running regular expressions are std::regex_match()
and std::regex_search()
.
In their most basic usage, they accept two arguments - the string we want to test, and the regex we want to use:
#include <regex>
#include <iostream>
int main(){
std::string Input{"hello world"};
std::regex Pattern{"hello"};
bool MatchResult{
std::regex_match(Input, Pattern)};
bool SearchResult{
std::regex_search(Input, Pattern)};
}
std::regex_match()
returns true if the entire input string matches the patternstd::regex_search()
returns true if part of the input string matches the patternBelow, we run the regex pattern hello
on the string hello world
. The regex_match()
call will return false because the string doesn’t match the entire pattern. The regex_search()
call will return true because a substring within the input that did match the pattern was found:
#include <iostream>
#include <regex>
int main() {
std::string Input{"hello world"};
std::regex Pattern{"hello"};
bool MatchResult{
std::regex_match(Input, Pattern)};
bool SearchResult{
std::regex_search(Input, Pattern)};
std::cout << "The regex_match pattern "
<< (MatchResult ? "did" : "did NOT")
<< " match";
std::cout << "\nThe regex_search pattern "
<< (SearchResult ? "did"
: "did NOT")
<< " match";
}
The regex_match pattern did NOT match
The regex_search pattern did match
When creating our pattern using the std::regex
constructor, we can provide a second argument. This is where we can pass flags that modify the behavior of our expression. Most of these flags are for advanced use cases that we won’t cover here.
However, there is one that is simple to understand and commonly useful: std::regex::icase
flags our expression as being case insensitive. The following pattern searches for "hello", "HELLO", "Hello", "hElLo" and any other variation:
std::regex Pattern{"hello", std::regex::icase};
Below, we search for hello
. Our string doesn’t contain hello
, but it does have Hello
:
#include <iostream>
#include <regex>
int main() {
std::string Input{"Hello world"};
std::regex SensitivePattern{"hello"};
std::regex InsensitivePattern{
"hello", std::regex::icase};
bool SensitiveResult{std::regex_search(
Input, SensitivePattern)};
bool InsensitiveResult{std::regex_search(
Input, InsensitivePattern)};
std::cout << "The sensitive pattern "
<< (SensitiveResult ? "did"
: "did NOT")
<< " match";
std::cout << "\nThe insensitive pattern "
<< (InsensitiveResult ? "did"
: "did NOT")
<< " match";
}
The sensitive pattern did NOT match
The insensitive pattern did match
In the next section, we’ll begin to see more complicated regular expressions, that include a lot of special characters. This particularly includes the backslash character, \
, which needs to be escaped in standard C++ string literals.
These additional escape characters can make regular expressions even more difficult to follow. So, for more complex expressions, it is recommended to construct our patterns from raw string literals.
Whilst string literals begin and end with "
, raw string literals begin with R"(
and end with )"
:
// String Literal
std::regex PatternA{"hello"};
// Raw String Literal
std::regex PatternB{R"(hello)"};
With regular expressions, we’re not just restricted to simple character matching. We have the option to add more complex syntax into our pattern, to create more elaborate behavior.
Within the same pattern, we can combine as many of these special characters as we want. Like other programming expressions, how they combine is subject to an order of operations that is not entirely intuitive.
Few people learn the order of operations. Instead, we just use the dedicated regex tools talked about previously to see what works and what doesn’t.
In the next lesson, we’ll learn how to manipulate the order of operations within regex by using groups
Below, we cover the most common special characters:
.
The period character .
acts as a wildcard, matching any character within a string. For example, the regex pattern c.t
will match any three-letter sequence that starts with "c" and ends with "t":
✔️ c.t -> cat
✔️ c.t -> cut
✔️ c.t -> cot
It only matches a single character, so a pattern like c.t
will not match a string like cart
 :
❌ c.t -> cart
❌ c.t -> colt
❌ c.t -> carrot
But we can use multiple .
 tokens:
✔️ c..t -> cart
✔️ c..t -> colt
❌ c..t -> carrot
✔️ c....t -> carrot
We’ll see better ways of matching multiple characters later in this lesson.
|
The vertical bar character, |
allows the input to match either the left or the right pattern:
✔️ cat|hat -> cat
✔️ cat|hat -> hat
❌ cat|hat -> tat
✔️ cat|h.t -> hit
?
The question mark character, ?
flags the previous symbol as being optional:
✔️ can?t -> cant
✔️ can?t -> cat
✔️ ca.?t -> cart
✔️ ca.?t -> cat
^
The caret symbol ^
denotes the start of the input, allowing us to match patterns that only at the beginning of our string:
✔️ ^cat -> cat
❌ ^cat -> the cat
✔️ ^cat|the -> the
✔️ ^.at -> mat
❌ ^.at -> the mat
$
Conversely, the dollar symbol $
denotes the end of the input, which allows us to restrict our search just to the end of the string:
✔️ cat$ -> cat
❌ cat$ -> the cat sat
✔️ cat|mat$ -> the cat
✔️ cat|mat$ -> sat on the mat
✔️ ca.$ -> the can
Earlier, we noted the C++ standard library has the regex_match()
function, which matches the entire input, whilst regex_search()
looks for substrings. Many programming languages don’t draw a distinction here - they just offer the equivalent of regex_search()
.
However, using the ^
and $
symbols allows us to create a pattern that restricts our pattern to matching against the entire input anyway:
✔️ ^cat$ -> cat
❌ ^cat$ -> cats
❌ ^cat$ -> the cat
✔️ ^.at$ -> cat
✔️ ^.at$ -> mat
❌ ^.at$ -> mate
Often, we want to use one of the special characters as their literal meaning within our patterns. For example, imagine we want to check for a literal period within our input. However, .
denotes a wild card, so adding it to our regex will match any character.
We can escape characters using the backslash symbol: \
. So, if we wanted to check for a literal .
, our regex would use \.
Unescaped . is a wild card
✔️ sat on the mat. -> sat on the mat.
✔️ sat on the mat. -> sat on the mat and
✔️ sat on the mat. -> sat on the mate
Escape a special character using \
✔️ sat on the mat\. -> sat on the mat.
❌ sat on the mat\. -> sat on the mate
❌ sat on the mat\. -> sat on the mat
Unescaped ? is the optional symbol
✔️ hello? -> hello
✔️ hello? -> hell
Escape it using \
❌ hello\? -> hello
❌ hello\? -> hell
✔️ hello\? -> hello?
Unescaped $ denotes end of input
❌ $3.50 -> $3.50
Unescaped . is a wild card
✔️ \$3.50 -> $3.50
✔️ \$3.50 -> $3-50
Escaping both special characters
✔️ \$3\.50 -> $3.50
❌ \$3\.50 -> $3-50
When we want to search for a literal \
in our input, the escape character itself can be escaped with an additional \
:
❌ yes\no -> yes\no
✔️ yes\\no -> yes\no
To search for literal \\ we escape both
❌ \\user\\files -> \\user\files
✔️ \\\\user\\files -> \\user\files
When we want to search for one of a range of possible characters, we can introduce a character set, sometimes also called character class. We do this by wrapping our characters in [
and ]
:. The following searches for bat
, cat
, mat
or rat
:
std::regex Pattern{R"([bcmr]at)"};
The order of characters within the set doesn’t matter.
Character sets interact with surrounding special characters as expected. For example, we can check if our input starts with something in a character set using the start-of-input symbol ^
, or make the entire set optional using the optional symbol ?
:
✔️ ^[Tt]he -> the cat
✔️ ^[Tt]he -> The cat
❌ ^[Tt]he -> Not the cat
✔️ [cbm]?at -> at
✔️ [cbm]?at -> cat
✔️ [cbm]?at -> bat
✔️ [cbm]?at -> mat
Within the [
and ]
boundary of a character set, the special characters .
, ?
, ^
, $
, and |
revert to their literal values. For example, the period symbol .
matches only a literal .
in the input:
✔️ cat[.] -> cat.
❌ cat[.] -> cats
We can specify numeric or alphabetic ranges within our character sets using -
. For example, [a-e]
will match any of a
, b
, c
, d
, or e
:
✔️ [a-h]am -> cam
✔️ [a-h]am -> ham
❌ [a-h]am -> ram
✔️ [a-z]am -> ram
❌ [a-z]am -> 9am
❌ [0-9]am -> ram
✔️ [0-9]am -> 9am
✔️ [0-9a-z]am -> 9am
✔️ [0-9a-z]am -> ram
Hexadecimal value
✔️ [A-F0-9][A-F0-9] -> FF
✔️ [A-F0-9][A-F0-9] -> E4
❌ [A-F0-9][A-F0-9] -> G4
Some character sets have shortcut symbols we can use instead:
\d
for any numeric digit, equivalent to [0-9]
\w
for any alphabetic character, digit, or underscore, equivalent to [a-zA-Z0-9_]
\s
for any white space (can be a space character, a line break character, a tab character, and so on)\d can be any digit
✔️ \d -> 9
❌ \d -> m
✔️ \dam -> 9am
❌ \dam -> dam
❌ \dam -> ram
Any two digits
✔️ \d\d -> 10
❌ \d\d -> 1
Making the second digit optional
✔️ \d\d? -> 10
✔️ \d\d? -> 1
\w can be any letter, digit, or underscore
✔️ \w -> m
✔️ \w -> 9
✔️ \wam -> ram
✔️ \w\wam -> roam
✔️ help\w -> helps
❌ help\w -> help!
\s can be any whitespace
✔️ the\sfox -> the fox
The \n here is a line break
✔️ the\sfox -> the\nfox
Any whitespace followed by any letter
✔️ the\s\wat -> the cat
✔️ the\s\wat -> the mat
❌ the\s\wat -> themat
Making whitespace optional
✔️ the\s?\wat -> themat
Combining \d \s and \w
❌ \d\d\s\wats -> 1 cat
❌ \d\d\s\wats -> 5 cats
✔️ \d\d\s\wats -> 05 cats
✔️ \d\d\s\wats -> 24 rats
✔️ \d\d\s\wats -> 24\nrats
❌ \d\d\s\wats -> 100 cats
❌ \d\d\s\wats -> four cats
Making the second \d and final \s optional
✔️ \d\d?\s\wats? -> 1 cat
✔️ \d\d?\s\wats? -> 5 cats
✔️ \d\d?\s\wats? -> 05 cats
✔️ \d\d?\s\wats? -> 24 rats
✔️ \d\d?\s\wats? -> 24\nrats
❌ \d\d?\s\wats? -> 100 cats
❌ \d\d?\s\wats? -> four cats
Character sets, and their shortcuts, can be escaped in the usual way, with \
. For example, if we wanted our regex to search for the [
character, we’d escape it as \[
Searching for the literal [hello]
❌ \[hello\] -> h
✔️ \[hello\] -> [hello]
Searching for literal \w
❌ \\w -> a
✔️ \\w -> \w
Searching for literal \d
❌ \\d -> 5
✔️ \\d -> \d
Searching for literal \s
❌ \\s -> the cat
✔️ \\s -> \s
By including the caret symbol ^
at the beginning of our character set, we can negate it. This allows us to ensure a set of characters is not included in our input at that position.
Searching for "at" not preceded by c, b or m
✔️ [^cbm]at -> at
❌ [^cbm]at -> cat
❌ [^cbm]at -> bat
❌ [^cbm]at -> mat
✔️ [^cbm]at -> rat
Searching for "at" not preceded by anything from c-m
✔️ [^c-m]at -> rat
Searching for "at" not preceded by a digit
✔️ [^\d]at -> cat
❌ [^\d]at -> 5at
✔️ [^\d]at -> 5 at
Searching for "cat" not followed by an alphanumeric character
✔️ cat[^\w] -> cat
✔️ cat[^\w] -> the cat sat
❌ cat[^\w] -> caterpillar
❌ cat[^\w] -> vacate
✔️ cat[^\w] -> copycat
Searching for "cat" not preceded or followed by an alphanumeric character
❌ [^\w]cat[^\w] -> copycat
We can look for repeating patterns within our input. We do that by adding syntax directly after the symbol or character set we want to look for repetitions of. We have several options:
*
The *
character states there can be any number of the proceeding symbol or character set. That can include zero:
✔️ ab*c -> ac
✔️ ab*c -> abc
✔️ ab*c -> abbc
✔️ ab*c -> abbbbbc
✔️ a.*c -> ac
✔️ a.*c -> abc
✔️ a.*c -> a123c
✔️ a.*c -> a123 abc
✔️ a[bcd]*e -> ae
✔️ a[bcd]*e -> abe
✔️ a[bcd]*e -> ace
✔️ a[bcd]*e -> abcde
✔️ a[bcd]*e -> abcdcdbe
❌ a[bcd]*e -> a1e
+
The +
character specifies we want at least one of the proceeding symbol or character set, but there can be more:
❌ ab+c -> ac
✔️ ab+c -> abc
✔️ ab+c -> abbc
✔️ ab+c -> abbbbbc
❌ a.+c -> ac
✔️ a.+c -> abc
✔️ a.+c -> a123c
✔️ a.+c -> a123 abc
❌ a[bcd]+e -> ae
✔️ a[bcd]+e -> abe
✔️ a[bcd]+e -> ace
✔️ a[bcd]+e -> abcde
✔️ a[bcd]+e -> abcdcdbe
❌ a[bcd]+e -> ale
x
repetitions: {x}
The brace syntax allows us to be more specific with how many repetitions we want. We can pass a single number between the braces, to specify we want a specific number of repetitions.
In the following example, we look for exactly two repetitions:
❌ ab{2}c -> ac
❌ ab{2}c -> abc
✔️ ab{2}c -> abbc
❌ ab{2}c -> abbbc
❌ a.{2}c -> ac
❌ a.{2}c -> abc
✔️ a.{2}c -> a12c
❌ a.{2}c -> a123c
❌ a[bcd]{2}e -> ae
❌ a[bcd]{2}e -> abe
❌ a[bcd]{2}e -> ace
✔️ a[bcd]{2}e -> abce
✔️ a[bcd]{2}e -> acde
❌ a[bcd]{2}e -> abcde
256 bit hexadecimal
✔️ [A-F0-9]{2} -> 6F
hexadecimal color
✔️ [A-F0-9]{6} -> 6F4AFF
Note, that a common mistake here comes when using this syntax alongside a substring search, like std::regex_search()
. In that context, a pattern like [0-9]{2}
which searches for exactly 2 digits will return true
on an input like 123
. Whilst the entire string of 123
has 3 digits, it has two substrings of 2 digits - 12
and 23
.
If we wanted this input to not match, we’d need to be more specific. For example, if we wanted our entire string to be exactly 2 digits, we could use std::regex_match()
, instead of std::regex_search()
, or add the start and end of input symbols ^
and $
to our regex.
x
repetitions: {x,}
By adding a trailing comma within our braces, we specify that we want at least x
repetitions, without an upper limit. Below, we search for at least two repetitions:
❌ ab{2,}c -> ac
❌ ab{2,}c -> abc
✔️ ab{2,}c -> abbc
✔️ ab{2,}c -> abbbc
❌ a.{2,}c -> ac
❌ a.{2,}c -> abc
✔️ a.{2,}c -> a12c
✔️ a.{2,}c -> a123c
❌ a[bcd]{2,}e -> ae
❌ a[bcd]{2,}e -> abe
❌ a[bcd]{2,}e -> ace
✔️ a[bcd]{2,}e -> abce
✔️ a[bcd]{2,}e -> acde
✔️ a[bcd]{2,}e -> abcde
x
to y
repetitions: {x, y}
:By adding a second number to our braces, we can specify both a lower and upper range for the number of repetitions we are looking for. Below, we search for at least one, but not more than three repetitions of our previous symbol or character set:
Looking for 1 to 3 repetitions
❌ ab{1,3}c -> ac
✔️ ab{1,3}c -> abc
✔️ ab{1,3}c -> abbc
✔️ ab{1,3}c -> abbbc
❌ ab{1,3}c -> abbbbc
❌ a.{1,3}c -> ac
✔️ a.{1,3}c -> abc
✔️ a.{1,3}c -> a12c
✔️ a.{1,3}c -> a123c
❌ a.{1,3}c -> a1234c
❌ a[bcd]{1,3}e -> ae
✔️ a[bcd]{1,3}e -> abe
✔️ a[bcd]{1,3}e -> ace
✔️ a[bcd]{1,3}e -> abce
✔️ a[bcd]{1,3}e -> acde
✔️ a[bcd]{1,3}e -> abcde
❌ a[bcd]{1,3}e -> abbcde
The repetition specifiers also work with the character set shortcuts \s
, \w
, and \d
:
❌ the\s+cat -> thecat
✔️ the\s+cat -> the cat
✔️ the\s+cat -> the cat
✔️ c\w*t -> ct
✔️ c\w*t -> cat
✔️ c\w*t -> coat
❌ c\w*t -> c5t
❌ \d{2} -> 1
✔️ \d{2} -> 10
Substring match
✔️ \d{2} -> 100
Full string match
❌ ^\d{2}$ -> 100
❌ \d{2,} -> 1
✔️ \d{2,} -> 10
✔️ \d{2,} -> 100
✔️ \d{1,3} -> 1
✔️ \d{1,3} -> 10
✔️ \d{1,3} -> 100
Substring match
✔️ \d{1,3} -> 1000
Full string match
❌ ^\d{1,3}$ -> 1000
In the next lesson, we’ll cover regular expression capture groups. These build on the regex syntax we learned but will allow us to go beyond just checking whether or not the text matches the regular expression pattern we created.
With capture groups, we will learn how to use regex to filter through and extract parts of our text, or to replace segments of it in a targeted way.
In this lesson, we've explored the fundamentals of using regular expressions in C++, covering how to create patterns, search and match strings, and apply modifiers for case-insensitivity.
<regex>
header and std::regex
for creating regular expression patterns in C++.std::regex_match()
and std::regex_search()
, and examples of their basic usage.std::regex::icase
..
), alternation (|
), and optional (?
) characters.^
for start of input, $
for end of input) to specify where a pattern should match.+
, {x}
, {x,}
, and {x, y}
) to match repeating patterns.\d
, \w
, \s
) in regex, and how to negate character sets with ^
.An introduction to regular expressions, and how to use them in C++ with the standard library's regex
, regex_match
, and regex_search
Comprehensive course covering advanced concepts, and how to use them on large-scale projects.