How do I handle Unicode normalization in C++?

Question

Ryan McCombe · Accepted Answer

Unicode normalization is the process of converting strings to a standard form, ensuring that equivalent strings have a unique binary representation.

This is crucial for string comparison and searching. In C++, handling Unicode normalization typically involves using a third-party library, as the standard library doesn't provide built-in support for this. Here's how you can approach it:

Using the ICU Library

The International Components for Unicode (ICU) library is a comprehensive Unicode library that provides normalization functions. Here's an example of how to use ICU for Unicode normalization:

1#include <iostream>
2#include <string>
3#include <unicode/unistr.h>
4#include <unicode/normlzr.h>
5
6std::string toUTF8(const icu::UnicodeString& str) {
7  std::string result;
8  str.toUTF8String(result);
9  return result;
10}
11
12int main() {
13  UErrorCode status = U_ZERO_ERROR;
14
15  // Two representations of the same character
16  // 'e' followed by combining acute accent
17  icu::UnicodeString str1 = "e\u0301";
18
19  // '' as a single character
20  icu::UnicodeString str2 = "\u00E9";
21
22  std::cout << "Before normalization:\n";
23  std::cout << "str1: " << toUTF8(str1) << "\n";
24  std::cout << "str2: " << toUTF8(str2) << "\n";
25  std::cout << "Equal: "
26    << (str1 == str2 ? "Yes" : "No")  << "\n\n";
27
28  // Normalize to NFC (Normalization Form
29  // Canonical Composition)
30  icu::UnicodeString nfc1 =
31      str1.normalize(UNORM_NFC, status);
32  icu::UnicodeString nfc2 =
33      str2.normalize(UNORM_NFC, status);
34
35  std::cout << "After NFC normalization:\n";
36  std::cout << "str1: " << toUTF8(nfc1) << "\n";
37  std::cout << "str2: " << toUTF8(nfc2) << "\n";
38  std::cout << "Equal: "
39    << (nfc1 == nfc2 ? "Yes" : "No");
40}

1Before normalization:
2str1: 
3str2: 
4Equal: No
5
6After NFC normalization:
7str1: 
8str2: 
9Equal: Yes

Normalization Forms

Unicode defines four normalization forms:

NFD (Normalization Form Decomposition): Characters are decomposed by canonical equivalence.
NFC (Normalization Form Composition): Characters are decomposed and then recomposed by canonical equivalence.
NFKD (Normalization Form Compatibility Decomposition): Characters are decomposed by compatibility equivalence.
NFKC (Normalization Form Compatibility Composition): Characters are decomposed by compatibility, then recomposed by canonical equivalence.

NFC is often the most practical for general use, as it provides a unique representation for visually identical strings while maintaining composed characters where possible.

Considerations

Performance: Normalization can be computationally expensive. Consider caching normalized strings if you're working with large amounts of text.
Storage: Normalized strings may require more or less storage than the original, depending on the normalization form and the original text.
Compatibility: Ensure all parts of your system use the same normalization form to avoid inconsistencies.

Remember, while normalization is crucial for correct string comparisons, it's not a silver bullet for all Unicode-related issues. You'll still need to consider other aspects of internationalization and localization in your C++ programs.

Characters, Unicode and Encoding

Implementing Unicode Normalization

Using the ICU Library

Normalization Forms

Considerations

Characters, Unicode and Encoding

Professional C++

Questions & Answers