So far, our programs have been constructed as a single stream of instructions that run sequentially. This form of execution is the easiest to understand, but rarely the most performant.
In most environments where our code will be executed, it is capable of running multiple processes concurrently. For example:
This capability is sometimes referred to as concurrency or parallelism and our programs can be orders of magnitude faster if we design them to take advantage of it.
In programming, a stream of execution is often called a thread. So far, our programs have been single-threaded - the thread starts with our main
function, and instructions are executed in order, one at a time, until our program ends.
To use concurrency, we can look for opportunities to break our program into multiple threads. For example, we might decide to create a separate thread to handle audio.
The user’s operating system can then run that thread on a separate core from our main thread. And, now that the audio logic has been removed from the main thread, it has less work to do, so it can update and respond to user input faster.
It’s worth noting that a thread of execution doesn’t necessarily correspond to a core on the user’s system. After all, our programs will generally be run across a range of hardware with different core counts, so we can’t design our code on that basis anyway.
By breaking our program into threads, we’re providing the operating system with more flexibility to run our program in a way that makes better use of the available hardware.
We have a dedicated chapter on multi-threading later in the course, which covers creating and managing threads directly. This lesson is focused on multi-threading within existing standard library algorithms. Many of these algorithms can implement multi-threading for us, using execution policies.
By default, the algorithms we use execute sequentially. We demonstrate this below using the for_each()
algorithm as we’ve done before.
Note, we’re not using the range-based variation of for_each()
- we’re using the iterator version. As of C++23, range-based algorithms do not yet support execution policies, but that is likely to change in later language versions.
We’ve simulated a slow-running function by including a one-second pause within the function body:
#include <algorithm>
#include <iostream>
#include <vector>
void Log(int Number){
using namespace std::chrono_literals;
std::this_thread::sleep_for(1s);
std::cout << "Number: " << Number << '\n';
}
int main(){
std::vector Numbers{1, 2, 3};
std::for_each(Numbers.begin(), Numbers.end(),
Log);
}
Running our program we’ll see that it runs in sequence. Our program takes approximately 3 seconds to run, with each function call completing approximately a second apart:
Number: 1
Number: 2
Number: 3
To change our execution policy to enable our program to run in parallel, we provide an execution policy as the first argument.
Execution policies are available within the <execution>
header, and the parallel policy is available as std::execution::par
:
#include <algorithm>
#include <iostream>
#include <vector>
#include <execution>
void Log(int Number){
using namespace std::chrono_literals;
std::this_thread::sleep_for(1s);
// There is now a bug here - see explanation below
std::cout << "Number: " << Number << '\n';
}
int main(){
std::vector Numbers{1, 2, 3};
std::for_each(std::execution::par,
Numbers.begin(), Numbers.end(),
Log);
}
Now, all our function calls should run in parallel. In most environments, this means our program should complete in around one second, with all our functions logging to the console at around the same time:
Number: 3
Number: 1
Number: 2
When we’re designing our programs to be multi-threaded, there are several important implications we have to keep in mind.
The first is perhaps obvious from the output of the previous program - we don’t know the order in which threads will complete. This means, every time we run our program, we could get a different output.
We might get:
Number: 1
Number: 3
Number: 2
Or:
Number: 2
Number: 3
Number: 1
Or any other combination
It may be confusing why the same instructions running on the same hardware have a non-deterministic execution time.
However, typically our program is not the only process that is running on a machine. Operating systems manage many programs and background tasks running at the same time.
Hardware resources (such as CPU cores) need to be allocated to our threads so they can be executed.
Those resources are not always allocated immediately, and our threads are not always allocated resources at the same time.
This process of allocating resources is typically called scheduling, and the process within the operating system that performs it is called the scheduler
Another implication is that, when threads running concurrently are accessing or operating on the same resources, we don’t know the order that this will happen.
In the previous code example, we highlighted a seemingly innocuous expression as having a bug:
std::cout << "Number: " << Number << '\n';
Specifically, if we run our program enough times, we’ll eventually get an output that looks something like this:
Number: Number: 1
Number: 3
2
This is because we have multiple threads manipulating a shared resource - std::cout
. Each thread is performing a sequence of three operations on std::cout
- note the three <<
 operators.
When writing sequential programs, this is fine. We know each function call will complete its three operations before the next function call starts.
But that is no longer guaranteed when we’re writing "thread-safe" code - that is, code that can be run in parallel.
Our code is not currently thread-safe, because any other thread can start writing to std::cout
in the middle of our sequence of three <<
 operations.
To fix this, we need to make our process atomic.
"Atomic" is based on the greek word "atomos", meaning "indivisible" or "uncuttable".
In the context of multi-threaded programming, "atomic" means that code running in other threads cannot "divide" or "cut into" our code.
Our previous attempt was not atomic because it was divided into three operations, allowing other threads to cut in between them.
On the other hand, an atomic process would appear to happen instantaneously.
As a result, other threads could not cut into it because there was no opportunity to do so. From their perspective, at any given point in time, an atomic process has either not started yet, or has fully completed.
The previous simple example can be made atomic by formatting the desired output within the local thread, and then sending it to std::cout
all at the same time:
#include <algorithm>
#include <iostream>
#include <vector>
#include <execution>
void Log(int Number){
using namespace std::chrono_literals;
std::this_thread::sleep_for(1s);
std::cout << std::format("Number: {}\n",
Number);
}
int main(){
std::vector Numbers{1, 2, 3};
std::for_each(std::execution::par,
Numbers.begin(), Numbers.end(),
Log);
}
Number: 3
Number: 2
Number: 1
This example is using std::format()
, which we covered in a dedicated lesson here:
We’ll cover the full suite of tools to manage shared resources in multi-threaded contexts in a dedicated chapter later in the course.
A bug where our program exhibits undesirable behavior based on the order of task execution is commonly called a race condition.
For example, if our program runs Task A
and Task B
concurrently, and our code is written based on the assumption that Task A
always finishes first, we have introduced a race condition.
These bugs can be particularly insidious because, by nature, they don’t always occur. For example, our assumption might be correct for 99% of executions, meaning the bug only happens 1% of the time.
This can make it very difficult to diagnose, or be missed entirely until our software is released and running at scale.
std::thread::id
The main way we can understand how our threads are being scheduled is through our debugger. The exact process for doing this depends on what IDE we’re using.
A guide to multithreaded debugging in Visual Studio is available on the Microsoft site here.
Alternatively, to determine what thread a process is running in, we can also call std::this_thread::get_id()
This returns a std::thread::id
- an integer which we can use to distinguish threads:
#include <algorithm>
#include <iostream>
#include <vector>
#include <execution>
void Log(int Number){
using namespace std::chrono_literals;
std::this_thread::sleep_for(1s);
std::cout << std::format(
"Thread: {}, Number: {}\n",
std::this_thread::get_id(), Number);
}
int main(){
std::vector Numbers{1, 2, 3};
std::cout << "Main Thread: "
<< std::this_thread::get_id() << '\n';
std::for_each(std::execution::par,
Numbers.begin(), Numbers.end(),
Log);
}
Running this code, we see that the first execution of Log
was run on the same thread as our main
function, whilst the other two executions each got their own thread:
Main Thread: 19668
Thread: 19668, Number: 1
Thread: 57160, Number: 3
Thread: 51188, Number: 2
By specifying a parallel execution policy, we are stating that our code is written such that it can be parallelized.
It is left to the operating system to decide what to do with that information. It doesn’t necessarily mean it always will be parallelized.
For example, if we make our function run faster, by removing the one-second sleep, we’re likely to see fewer threads, or possibly even a sequential execution:
#include <algorithm>
#include <execution>
#include <iostream>
#include <vector>
void Log(int Number){
// using namespace std::chrono_literals;
// std::this_thread::sleep_for(1s);
std::cout << std::format(
"Thread: {}, Number: {}\n",
std::this_thread::get_id(), Number);
}
int main(){/*...*/};
Main Thread: 31108
Thread: 31108, Number: 1
Thread: 31108, Number: 2
Thread: 31108, Number: 3
Even though we stated our algorithm could be run in parallel, the platform decided it would be more performant to just run it sequentially.
Multithreading requires additional operations - creating threads and communicating between them has an overhead. If our function calls are fast, the platform may decide that just running our algorithm sequentially is more performant.
In this lesson, we explored how to harness the power of parallelism in C++ by utilizing execution policies with standard library algorithms. This can enhance program performance through concurrent execution.
We delved into the subtleties of multithreaded programming, demonstrating how to write thread-safe code and the implications of parallel execution on program behavior.
std::execution::par
.std::format
for safer string formatting in a multithreaded context.Multithreading in C++ standard library algorithms using execution policies
Comprehensive course covering advanced concepts, and how to use them on large-scale projects.