Characters, Unicode and Encoding

Implementing Unicode Normalization

How do I handle Unicode normalization in C++?

Abstract art representing computer programming

Unicode normalization is the process of converting strings to a standard form, ensuring that equivalent strings have a unique binary representation.

This is crucial for string comparison and searching. In C++, handling Unicode normalization typically involves using a third-party library, as the standard library doesn't provide built-in support for this. Here's how you can approach it:

Using the ICU Library

The International Components for Unicode (ICU) library is a comprehensive Unicode library that provides normalization functions. Here's an example of how to use ICU for Unicode normalization:

#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/normlzr.h>

std::string toUTF8(const icu::UnicodeString& str) {
  std::string result;
  str.toUTF8String(result);
  return result;
}

int main() {
  UErrorCode status = U_ZERO_ERROR;

  // Two representations of the same character
  // 'e' followed by combining acute accent
  icu::UnicodeString str1 = "e\u0301";

  // 'Ă©' as a single character
  icu::UnicodeString str2 = "\u00E9";

  std::cout << "Before normalization:\n";
  std::cout << "str1: " << toUTF8(str1) << "\n";
  std::cout << "str2: " << toUTF8(str2) << "\n";
  std::cout << "Equal: "
    << (str1 == str2 ? "Yes" : "No")  << "\n\n";

  // Normalize to NFC (Normalization Form
  // Canonical Composition)
  icu::UnicodeString nfc1 =
      str1.normalize(UNORM_NFC, status);
  icu::UnicodeString nfc2 =
      str2.normalize(UNORM_NFC, status);

  std::cout << "After NFC normalization:\n";
  std::cout << "str1: " << toUTF8(nfc1) << "\n";
  std::cout << "str2: " << toUTF8(nfc2) << "\n";
  std::cout << "Equal: "
    << (nfc1 == nfc2 ? "Yes" : "No");
}
Before normalization:
str1: Ă©
str2: Ă©
Equal: No

After NFC normalization:
str1: Ă©
str2: Ă©
Equal: Yes

Normalization Forms

Unicode defines four normalization forms:

  1. NFD (Normalization Form Decomposition): Characters are decomposed by canonical equivalence.
  2. NFC (Normalization Form Composition): Characters are decomposed and then recomposed by canonical equivalence.
  3. NFKD (Normalization Form Compatibility Decomposition): Characters are decomposed by compatibility equivalence.
  4. NFKC (Normalization Form Compatibility Composition): Characters are decomposed by compatibility, then recomposed by canonical equivalence.

NFC is often the most practical for general use, as it provides a unique representation for visually identical strings while maintaining composed characters where possible.

Considerations

  • Performance: Normalization can be computationally expensive. Consider caching normalized strings if you're working with large amounts of text.
  • Storage: Normalized strings may require more or less storage than the original, depending on the normalization form and the original text.
  • Compatibility: Ensure all parts of your system use the same normalization form to avoid inconsistencies.

Remember, while normalization is crucial for correct string comparisons, it's not a silver bullet for all Unicode-related issues. You'll still need to consider other aspects of internationalization and localization in your C++ programs.

This Question is from the Lesson:

Characters, Unicode and Encoding

An introduction to C++ character types, the Unicode standard, character encoding, and C-style strings

Answers to questions are automatically generated and may not have been reviewed.

This Question is from the Lesson:

Characters, Unicode and Encoding

An introduction to C++ character types, the Unicode standard, character encoding, and C-style strings

A computer programmer
Part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, unlimited access

This course includes:

  • 124 Lessons
  • 550+ Code Samples
  • 96% Positive Reviews
  • Regularly Updated
  • Help and FAQ
Free, Unlimited Access

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Screenshot from Warhammer: Total War
Screenshot from Tomb Raider
Screenshot from Jedi: Fallen Order
Contact|Privacy Policy|Terms of Use
Copyright © 2024 - All Rights Reserved