Normalization

SourcePro Core : Internationalization Module User’s Guide : Normalization

Normalization

Introduction

A piece of text can sometimes be represented by more than one sequence of Unicode characters. This is because the Unicode standard recognizes two types of character equivalence, in which different Unicode code points or sequences of code points are considered equivalent forms of the same information. The two types of character equivalence give rise to four normalization forms. Each normalization form produces a unique representation for a given string.

Normalization is the process of converting Unicode text to a unique representation. Normalization facilitates sorting, searching, conversion, and data exchange. The W3C recommends that all data be normalized as early as possible.

In the Internationalization Module, class RWUNormalizer normalizes Unicode text. This chapter describes how to use RWUNormalizer to:

convert a string into a particular normalization form

detect whether a string is already in a particular form