Normalization
Introduction
A piece of text can sometimes be represented by more than one sequence of Unicode characters. This is because the Unicode standard recognizes two types of character equivalence, in which different Unicode code points or sequences of code points are considered equivalent forms of the same information. The two types of character equivalence give rise to four normalization forms. Each normalization form produces a unique representation for a given string.
Normalization is the process of converting Unicode text to a unique representation. Normalization facilitates sorting, searching, conversion, and data exchange. The W3C recommends that all data be normalized as early as possible.
In the Internationalization Module, class
RWUNormalizer normalizes Unicode text. This chapter describes how to use
RWUNormalizer to:
convert a string into a particular normalization form
detect whether a string is already in a particular form