regex_search
, regex_replace
, regex_iterator
, and regex_token_iterator
In this lesson, we’ll further build our regex knowledge, introducing capture groups, non-capture groups, and lazy selectors. Our previous lesson covered a broad introduction of regex, and how to use it to determine if a string matches our criteria.
Here, we will focus on a more advanced use case, where we want to extract information from the strings we receive.
Use cases for this include:
Finally, we’ll see how we can make use of these regex concepts in C++
This is intended to be a follow-up to our introductory lesson on Regular Expressions in C++, so familiarity with the concepts covered there is recommended:
In regular expressions, capture groups are defined by parentheses: (
and )
. For example, if we wanted to capture all of the times "hello"
appears in our string, our regex would look like this:
(hello)
If we wanted to capture only the "hello"
s that precede " world"
, it would look like this:
(hello) world
Within the capture group, we still have all of our usual regex powers, for example:
(hello|goodbye)
- Capture hello or goodbye([hc]ello)
- Capture hello and cello([hc]ello?)
- Capture hell, hello, cell and cello(hel*o)
- Capture heo, helo, hello, etc(hel\w)
- Capture hell, help, etchello(.*)world
- Capture everything between hello and worldJust like the order of operations in maths and programming can be manipulated using brackets, so too can the order of operations within regex.
Operators like ?
and *
can be applied to a capture group, while |
within a capture group constrains its effect to that group
✔️ The( big)? cat -> The cat
✔️ The( big)? cat -> The big cat
❌ The( big)? cat -> The big big cat
✔️ The( big)* cat -> The cat
✔️ The( big)* cat -> The big cat
✔️ The( big)* cat -> The big big cat
✔️ The (big|chonky) cat -> The big cat
✔️ The (big|chonky) cat -> The chonky cat
❌ The (big|chonky) cat -> The big chonky cat
❌ The( big| chonky)+ cat -> The cat
✔️ The( big| chonky)+ cat -> The big cat
✔️ The( big| chonky)+ cat -> The big chonky cat
✔️ The( big| chonky)+ cat -> The big chonky big cat
Note, if just want to manipulate the order of operations, and have no need to capture the content of our group, we can use a non-capture group instead. These are introduced a little later in this lesson.
When we want to look for literal (
and )
in our strings rather than creating a capture group, we can escape them in the usual way, using \
.
For example, if we want to search a string for the sequence "The (big) cat"
, our regex would be "The \(big\) cat"
Creating a capture group:
❌ The (big) cat -> The (big) cat
Searching for a pattern containing brackets
✔️ The \(big\) cat -> The (big) cat
Let's imagine we want to get a breakdown of what email providers our users have signed up with. We have a list of emails, like example@gmail.com
, and we want to generate a list of email providers, like gmail
.
Our first attempt at a regular expression might be to capture everything from the @
to the literal .
:
@(.*)\.
Some basic testing would indicate this works - given a string like example@gmail.com
, the substring of gmail
is captured as intended.
However, given the string example@gmail.co.uk
, the substring of gmail.co
is captured.
This is because, by default, repetition quantifiers such as *
, +
, and {x,y}
are greedy. They capture as much as possible.
In this example, the .*
in (.*)\.
is capturing everything until the last period in the string, not the next period.
We can change this by appending a ?
after the quantifier, thereby making it lazy:
@(.*?)\.
This also applies to the other quantifiers. The following lists what would be captured by various regular expressions, given the string of 54321!
.
We apply different quantifiers to the \d
character, to remove some of the digits from what is ultimately captured:
* matches as many repetitions as possible
✔️ \d*(.*) => 54321! = !
*? matches as few repetitions as possible
✔️ \d*?(.*) => 54321! = 54321!
+ matches as many repetitions as possible, but at least 1
✔️ \d+(.*) => 54321! = !
+? matches as few repetitions as possible, but at least 1
✔️ \d+?(.*) => 54321! = 4321!
{2,4} matches 2, 3 or 4 repetitions, preferring more
✔️ \d{2,4}(.*) => 54321! = 1!
{2,4}? matches 2, 3 or 4 repetitions, preferring fewer
✔️ \d{2,4}?(.*) => 54321! = 321!
The ability to create groups of tokens within our regular expression is generally useful, even if we don’t need to capture them. For these, we have non-capture groups. Non-capture groups start with (?:
and end with )
.
They have many use cases, including the following:
?
) to sub-patternsBelow, we make the string "brown "
 optional:
✔️ The (?:brown )?fox -> The fox
✔️ The (?:brown )?fox -> The brown fox
❌ The (?:brown )?fox -> The red fox
+
, and {}
) to sub-patternsBelow, we allow the string "red "
to appear 0-2Â times:
✔️ The (?:red ){0,2}fox -> The fox
✔️ The (?:red ){0,2}fox -> The red fox
✔️ The (?:red ){0,2}fox -> The red red fox
❌ The (?:red ){0,2}fox -> The red red red fox
The following example shows how we can use a non-capturing group to manipulate which part of the pattern the alternation operator |
applies to:
Default order of operators
✔️ The brown|red fox -> The brown
✔️ The brown|red fox -> red fox
Controlling it using a non-capture group
❌ The (?:brown|red) fox -> The brown
❌ The (?:brown|red) fox -> red fox
✔️ The (?:brown|red) fox -> The brown fox
✔️ The (?:brown|red) fox -> The red fox
The rest of this lesson will focus on how we can use capture groups within the C++ standard library’s regex helpers, available by including <regex>
std::match_results
and std::smatch
The std::match_results<>
template class is typically how we want to store the output of our regex operations. This is templated so it can be used with different types of strings.
However, an instance of the template class that works with std::string
has already been aliased for us. It is called std::smatch
, which is what we'll be using here.
The std::regex_search()
function has an overload that accepts a std::match_results
object as the second argument:
#include <regex>
int main() {
std::string Input{"Hello There"};
std::regex Pattern{"Hello There"};
std::smatch Match;
std::regex_search(Input, Match, Pattern);
}
We covered the basics of std::regex_search()
in our introductory lesson. Here, we’ll focus on its interaction with std::match_results
and capture groups.
Each object in the match results is a std::sub_match
. These objects contain some useful information about the sub-match that was found, as well as the matched string, which can be accessed using the str()
 method.
If the std::regex_search()
call was successful, the std::smatch
will contain at least one sub-match: the substring that matched the entire regex pattern we provided:
#include <regex>
#include <iostream>
int main() {
std::string Input{"Hello There"};
std::regex Pattern{"Hello There"};
std::smatch Match;
if (std::regex_search(Input, Match,
Pattern)) {
std::cout << Match.size()
<< " sub-match found!";
for (auto Submatch : Match) {
std::cout << "\nSubmatch: " << Submatch;
}
}
}
1 sub-match found!
Submatch: Hello There
When we’re not using capture groups, our std::match_results
container will only contain one std::sub_match
. However, when our regex contains capture groups, what was captured by those capture groups will be included in the std::smatch
 collection.
The overall match will be at index 0
, what was captured by the first capture group will be at index 1
, the second group at index 2
, and so on
#include <regex>
#include <iostream>
int main() {
std::string Input{"Hi All"};
std::regex Pattern{"(Hello|Hi) (There|All)"};
std::smatch Match;
if (std::regex_search(Input, Match,
Pattern)) {
std::cout << Match.size()
<< " submatches found!";
for (auto Submatch : Match) {
std::cout << "\nSubmatch: " << Submatch;
}
}
}
3 submatches found!
Submatch: Hi All
Submatch: Hi
Submatch: All
The std::smatch
has some additional properties and methods we may find useful. For example, the position()
method accepts an integer parameter and will return the starting position of the corresponding sub-match within the input string.
Additionally, the std::sub_match
objects have fields and properties we may find useful, including:
length()
- the length of the sub-match stringfirst()
- an iterator to the first character in the sub-matchlast()
- an iterator to the last character in the sub-match#include <regex>
#include <iostream>
int main() {
std::string Input{"Hello World"};
std::regex Pattern{"Hello (.*)"};
std::smatch Matches;
if (std::regex_search(Input, Matches,
Pattern)) {
std::cout << Matches.size()
<< " submatches found!";
for (size_t i{0}; i < Matches.size(); ++i) {
std::cout << "\n\nSubmatch " << i << ": "
<< Matches[i] << "\n Length: "
<< Matches[i].length()
<< "\n First Character: "
<< *Matches[i].first
<< "\n Position: "
<< Matches.position(i);
}
}
}
2 submatches found!
Submatch 0: Hello World
Length: 11
First Character: H
Position: 0
Submatch 1: World
Length: 5
First Character: W
Position: 6
std::regex_iterator
Calls to std::regex_search()
will stop once a match is found. However, our regex may match multiple patterns in our input strings.
When we want to match all instances of a pattern within our string, we have some other options we can use. We could do it with multiple calls to std::regex_search()
, but it’s typically easier and safer to use the standard library's dedicated regex iterators instead.
\g
flagIn other programming languages, this style of regex matching is often called "global search". Typically, we’d activate it by appending a \g
token to the end of the regex pattern, and just using the same function - such as that language’s equivalent to std::reges_search()
.
When using the C++ standard library, we don’t have that option. We need to write a bit more code to implement global search. However, the trade-off is that we have full control over how it behaves.
The std::regex_iterator
is also a template class, but an alias has been provided if we're working with std::string
objects. The alias is std::sregex_iterator
.
We construct the starting iterator by passing a std::string
iterator pair as the first two arguments, representing where we want the search to begin, and where we want it to end. Typically, we want to search the entire string, so we just pass the results of the begin()
and end()
methods of our input string.
The third argument we need to pass to the constructor is our regex pattern:
std::string Input{
"Hello World, Goodbye World"};
std::regex Pattern{
"(Hello|Goodbye) (World|Everyone)"};
std::sregex_iterator Iterator{
Input.begin(), Input.end(), Pattern};
To create an end iterator to compare against, we can create a second std::sregex_iterator
, passing no arguments.
With this setup, we can now use the std::sregex_iterator
to iterate through all the matches found in our input string.
Similar to calls to std::regex_search()
, each iteration will yield a std::match_results
object. We can then access the std::sub_match
objects within each container in the usual way:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"Hello World, Goodbye World"};
std::regex Pattern{
"(Hello|Goodbye) (World|Everyone)"};
std::sregex_iterator Iterator{
Input.begin(), Input.end(), Pattern};
std::sregex_iterator End;
while (Iterator != End) {
std::cout << "Match";
for (auto Match : *Iterator) {
std::cout << "\n Submatch: " << Match;
}
std::cout << "\n\n";
++Iterator;
}
}
Match
Submatch: Hello World
Submatch: Hello
Submatch: World
Match
Submatch: Goodbye World
Submatch: Goodbye
Submatch: World
std::regex_token_iterator
We have an alternative regex iterator we can use - the std::regex_token_iterator
. A std::string
version is available as std::sregex_token_iterator
.
We construct and iterate over it in the same way we did std::sregex_iterator
, but the matches are provided in a simpler form.
The token iterator skips the intermediate std::match_results
containers - it instead just iterates directly through the std::sub_match
 objects.
By default, it provides us with the sub-match at index 0
of each match. That is, it gives us the sub-matches that matched the entire regex pattern, rather than any specific capture group:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"Hello World, Goodbye Everyone"};
std::regex Pattern{
"(Hello|Goodbye) (World|Everyone)"};
std::sregex_token_iterator Iterator{
Input.begin(), Input.end(), Pattern};
std::sregex_token_iterator End;
while (Iterator != End) {
auto res = (*Iterator);
std::cout << "\nSubmatch: " << (*Iterator);
++Iterator;
}
}
Submatch: Hello World
Submatch: Goodbye Everyone
By passing a 4th argument to the std::sregex_token_iterator
, we can specify which sub-match we want. Below, we specify index 1
, ie, the sub-match that was captured by our first capture group:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"Hello World, Goodbye Everyone"};
std::regex Pattern{
"(Hello|Goodbye) (World|Everyone)"};
std::sregex_token_iterator Iterator{
Input.begin(), Input.end(), Pattern, 1};
std::sregex_token_iterator End;
while (Iterator != End) {
auto res = (*Iterator);
std::cout << "\nSubmatch: " << (*Iterator);
++Iterator;
}
}
Submatch: Hello
Submatch: Goodbye
We can pass multiple indices to the 4th argument, using a std::vector
, a C-style array, or an initializer list.
Below, we specify we want all the sub-matches of our regex. We know this pattern has two capture groups, so we expect three sub-matches per match. We want the overall match 0
, and the capture groups 1
and 2
:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"Hello World, Goodbye Everyone"};
std::regex Pattern{
"(Hello|Goodbye) (World|Everyone)"};
std::sregex_token_iterator Iterator{
Input.begin(),
Input.end(),
Pattern,
{0, 1, 2}};
std::sregex_token_iterator End;
while (Iterator != End) {
auto res = (*Iterator);
std::cout << "\nSubmatch: " << (*Iterator);
++Iterator;
}
}
Submatch: Hello World
Submatch: Hello
Submatch: World
Submatch: Goodbye Everyone
Submatch: Goodbye
Submatch: Everyone
If we have a std::regex
object and need to programmatically find out how many capture groups it contains, we can use the mark_count()
 method.
#include <regex>
#include <iostream>
int main() {
std::regex Pattern{
"(Hello|Goodbye) (World|Everyone)"};
std::cout << Pattern.mark_count();
}
2
This function is named mark_count()
as an alternative name for a capture group is a "marked subexpression".
std::regex_replace()
The std::regex_replace()
function allows us to make changes to a string, based on a regular expression. In the following example, we replace every instance of "World"
with the string "Everyone"
:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"Hello World, Goodbye World"};
std::regex Search{"World"};
std::string Replace{"Everyone"};
std::string Updated{std::regex_replace(
Input, Search, Replace)};
std::cout << "Before: " << Input
<< "\n After: " << Updated;
}
Before: Hello World, Goodbye World
After: Hello Everyone, Goodbye Everyone
Below, we use a slightly more complicated regex pattern to replace anything that looks somewhat like an email address:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"email me at bob@gmail.com or "
"bob@yahoo.com"};
std::regex Search{R"(\w*@[\w.]*)"};
std::string Replace{"[redacted]"};
std::string Updated{std::regex_replace(
Input, Search, Replace)};
std::cout << "Before: " << Input
<< "\n After: " << Updated;
}
Before: email me at bob@gmail.com or bob@yahoo.com
After: email me at [redacted] or [redacted]
Note: A robust regex pattern for email addresses is significantly more complicated than this. The patterns used throughout this lesson have been simplified for clarity.
std::regex_replace()
with Capture GroupsWhen using std::regex_replace()
with capture groups, we can include what was captured within our replacement string. We do this using the $
symbol, followed by the number of our capture group within our regex pattern, starting from 1. For example, $1
, $2
, $3
, and so on.
Below, we change how negative numbers are displayed. For example, -100
becomes (100)
We do this by adding a capture group to our regex, and then referencing what was captured by that group using $1
in our replacement string:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"The balances are 400, -100 and 250"};
std::regex Search{R"(-(\d*))"};
std::string Replace{"($1)"};
std::string Updated{std::regex_replace(
Input, Search, Replace)};
std::cout << "Before: " << Input
<< "\n After: " << Updated;
}
Before: The balances are 400, -100 and 250
After: The balances are 400, (100) and 250
In this example, we use multiple capture groups to reorder and duplicate parts of our string:
#include <regex>
#include <iostream>
int main() {
std::string Input{"The name's James Bond"};
std::regex Search{"(The name's) (.*) (.*)"};
std::string Replace{"$1 $3, $2 $3"};
std::string Updated{std::regex_replace(
Input, Search, Replace)};
std::cout << "Before: " << Input
<< "\n After: " << Updated;
}
Before: The name's James Bond
After: The name's Bond, James Bond
If we want our replacement string to include a literal dollar value, like "I can pay $3"
, we escape the capture group reference using an additional dollar sign. For example, to have $3
in our replacement string, we’d use $$3
:
#include <regex>
#include <iostream>
int main() {
std::string Input{"The price is {price}"};
std::regex Search{R"(\{price\})"};
std::string Replace{"$$3.50"};
std::cout << std::regex_replace(Input, Search,
Replace);
}
The price is $3.50
We can access the full substring that was matched using the $&
token. This is equivalent to the contents at index 0
of a std::match_results
 object:
#include <regex>
#include <iostream>
int main() {
std::string Input{
"The hungry brown cat and the sleepy "
"black bear"};
std::regex Search{
".*?(happy|hungry|sleepy) (brown|black) "
"(bear|fox|cat).*?"};
std::string Replace{
"Matched: $& \n Animal: $3\n Color: "
"$2\n Mood: $1\n\n"};
std::cout << std::regex_replace(Input, Search,
Replace);
}
Matched: The hungry brown cat
Animal: cat
Color: brown
Mood: hungry
Matched: and the sleepy black bear
Animal: bear
Color: black
Mood: sleepy
In this lesson, we explored regex capture groups and their application within C++, enabling us to manipulate and extract specific parts of strings for both analysis and transformation.
std::smatch
and std::match_results
in storing and accessing matched results from regex operations.std::regex_search()
, std::regex_replace()
, std::regex_iterator
, and std::sregex_token_iterator
to perform complex regex operations in C++.std::regex_replace()
.mark_count()
.An introduction to regular expression capture groups, and how to use them in C++ with regex search
, replace
, iterator
, and token_iterator
Comprehensive course covering advanced concepts, and how to use them on large-scale projects.