We’ve been working with text and strings a lot so far, but we’ll go deeper in this lesson. We’ll gain a much more complete insight into how C++, and programming in general, deals with text.
First, we’ll discuss characters - the building blocks of text. We’ll introduce the complexities of character types, exploring the Unicode standard and different representations used to store characters, and how they impact memory and performance
We’ll also cover C-style strings, how they’re used, their pitfalls, and some useful functions that can help us work with them.
When it comes to text, the most fundamental data type is the individual character. Examples of characters include letters like H
and n
, numbers like 2
and 0
, and punctuation like <
and !
.
In C++, there are many different types we can use to store characters, and we’ll go over some of the options later in this lesson. For now, the most common built-in type we use for characters is the simple char
:
int main() {
char Letter;
}
A char
literal is created by enclosing the character in single quotes:
int main() {
char Letter { 'H' };
std::cout << Letter << 'i';
}
Hi
Note: in most cases, we should not use C-Style strings. More modern approaches such as std::string
are typically easier and safer to work with, and they come with more capabilities out of the box. However, C-style strings remain in common use, so a basic knowledge of what they are and how they work is extremely useful.
We can imagine strings as simply being a collection of characters.
A C-style string is an example of a null-terminated string. These work by storing each character in a contiguous block of memory, like an array. Rather than keeping track of the size of the array, they instead use a special character to denote where the string ends.
This character is stored in the block of memory immediately after the string, so as our code advances through the contiguous block of memory, it will eventually find this character, and know that the string has ended.
This character is called the null character or null terminator. The null character is an example of a control character or non-printable character. We cover control characters in more detail a little later.
The null character has no visual representation but, in code, diagrams, and writing, it is often denoted as a backslash followed by a zero: \0
Putting everything together, a null-terminated string containing the text "Hello" occupies 6 contiguous bytes in memory:
With this convention in place, we can then store strings simply as a pointer to the first character - a char*
.
Anything receiving such a pointer will know that the content of the string continues in memory until the null terminator is encountered.
Typically, we should consider these characters to be const
, so the data type of a C-style string is often declared as a pointer to a const char
:
int main() {
const char* MyString;
}
A c-style string literal is created by placing the characters between double quotes:
#include <iostream>
int main() {
const char* MyString{"Hello"};
std::cout << MyString << " World";
}
Hello World
Not all characters represent visible symbols. Some characters are intended to have some additional meaning which the receiver should interpret as a special instruction.
These are referred to as control characters. The new line character - \n
- which we’ve been using a lot is one such example. The null terminator - \0
- covered above is another, and there are many more.
\0
two characters?Whilst \0
is two characters in our source file, the way we represent a character when programming isn’t necessarily the same as how it is represented in the final compiled machine code.
Control characters typically have no visual appearance when rendered as text - they're often referred to as "non-printable characters". But, when we’re writing code, we still need to be able to insert them where appropriate, and, more importantly, we also want to be able to see where they have been inserted.
Therefore, we need to give them a visual appearance when they’re being used in the source code. \0
and similar syntax is just a standardized way to accomplish that task. Once our code is compiled, syntax such as this is represented in memory in the same way as any other character, and they become invisible to users as intended.
In C++, we insert control characters using a backslash - \
If we want a literal backslash in our string, we need to insert two backslashes: \\
#include <iostream>
int main(){
// This won't behave how we want
std::cout << "C:\Windows\notepad.exe";
std::cout << "\n\n";
// We need this instead
std::cout << "C:\\Windows\\notepad.exe";
std::cout << '\n';
// To insert \\ we need to escape both
std::cout << "\\\\host-name\\shared.txt";
}
C:Windows
otepad.exe
C:\Windows\notepad.exe
\\host-name\shared.txt
The need to escape backslashes can make some of our strings unnecessarily complex. To solve this problem, we have raw strings instead.
Raw string literals start with R"(
and end with ")
#include <iostream>
int main(){
std::cout << R"(no \n new \n lines \n here)";
std::cout << \n\n";
std::cout << R"(C:\Windows\notepad.exe)";
}
Raw strings treat backslashes as literal characters - everything between the delimiters R"(
and ")
becomes part of the string, including what would normally be escape sequences.
no \n new \n lines \n here
C:\Windows\notepad.exe
Raw strings are not a dedicated type - we can store their output in the usual ways, such as a C-style string or std::string
:
int main(){
std::string A {R"(no \n new \n lines)"};
const char* B {R"(C:\Windows\notepad.exe)"};
}
The primary char
type allocates 1 byte (8 bits) of memory. This is enough to store one of 256 different characters. However, there are more than 256 possible characters - particularly once we consider non-English alphabets.
Unicode is designed to standardize the handling of text across most of the world’s writing systems. The latest version of the standard identifies approximately 150,000 characters. This includes alphabetic characters and punctuation from 160 languages, symbols such as © and ♫, and emojis such as ❤️ and 🎉
The way we represent characters in computer memory (ie, bytes) is referred to as encoding.
There are several ways to encode Unicode characters. The most common international standard is UTF, which has three common variations:
UTF-8 is most commonly used, as the variable size (from one to four bytes) allows for some performance gains. This is because, whilst we need four bytes to capture all possible Unicode characters, many can be stored in fewer.
This is particularly true of the more commonly used characters. Unicode is designed such that all English / Latin alphabetic characters, punctuation, and numbers can be represented by a single byte.
As such, with UTF-8, a sequence like "Hello" only requires 5 bytes whilst, under UTF-32, it would require 20 bytes (5 characters, each requiring 4Â bytes).
An endorsement for UTF-8 that goes deeper into the technical details is available here.
C++ provides several built-in character types:
char
, which has 1 bytewchar_t
, which has either 2 or 4 bytes (depending on the compiler)char8_t
, char16_t
, and char32_t
specify exactly how many bits they should have, but these types are not widely used or supported.For larger projects that have localization requirements, we’ll generally be building dedicated infrastructure to support complex text requirements, or using a third-party library to handle it for us. In most other cases, using the basic char
type with UTF-8 encoding is a sensible default, unless we have a compelling reason to do otherwise.
We still need to be aware that alternative options exist though, as we may be required to use them. For example, software in China follows a different standard.
The libraries and APIs we’re interacting with can also follow different standards. The Windows API uses wide characters, encoded with UTF-16. If we’re heavily using that API, that may convince us to adopt the same convention or force us to handle conversions at the interface points.
We can store Unicode characters in our regular char
 type.
However, because char
only stores 8 bits, we will need multiple char
s to contain more exotic Unicode characters, such as emojis. So, even though we have a single character, under UTF-8 we may need to store it as a string:
#include <iostream>
int main(){
// This would be an error - narrowing conversion
// char Heart{'❤'};
// Emoji requires multiple bytes, so use a string
const char* Heart{"❤"};
std::cout << "Hello World " << Heart;
}
Hello World ❤
There are some platform-specific considerations here. Depending on your operating system and compiler, additional steps may be required to properly handle Unicode characters.
For example, on Windows, we may need to update our project properties to set our source and execution character set, following the instructions available here.
Within our code, we may also need to set the console output "code page". We can do this by including windows.h
and then calling SetConsoleOutputCP
with the argument CP_UTF8
:
#include <windows.h>
#include <iostream>
int main(){
SetConsoleOutputCP(CP_UTF8);
const char* Heart{"❤"};
std::cout << "Hello Windows " << Heart;
}
Hello Windows ❤
In this lesson, we explored handling characters, strings, and encoding in C++, including the use of C-style strings, Unicode, and different character types.
char
being the most commonly used data type for their representation.\n
for new-line and \0
for null terminator, play special roles in text processing.char
, wchar_t
, char16_t
, char32_t
, and char8_t
(since C++20), to support different encoding needs and language characters.char
values or the use of wider character types, especially for characters beyond the basic ASCII range.An introduction to C++ character types, the Unicode standard, character encoding, and C-style strings
Comprehensive course covering advanced concepts, and how to use them on large-scale projects.