Normalization Forms

SourcePro Core : Internationalization Module User’s Guide : Normalization : Normalization Forms

Normalization Forms

A normalization form produces a unique representation for any given string. The two types of character equivalence described in Character Equivalence give rise to four normalization forms, as defined by the Unicode Standard Annex #15, Unicode Normalization Forms:

http://www.unicode.org/unicode/reports/tr15/

The four normalization forms are:

Normalization Form Decomposed (NFD)

Composite characters are replaced by canonically equivalent character sequences, in canonical order. Compatibility characters are unaffected.

Normalization Form Compatibility Decomposed (NFKD)

Composite characters are replaced by canonically equivalent character sequences, in canonical order. Compatibility characters are replaced by their nominal counterparts.

Normalization Form Composed (NFC)

Character sequences are replaced by canonically equivalent composites, where possible. Compatibility characters are unaffected. The W3C generally recommends that strings be interchanged in NFC.

Normalization Form Compatibility Composed (NFKC)

Character sequences are replaced by canonically equivalent composites, where possible. Compatibility characters are replaced by their nominal counterparts.

Two of the normalization forms, NFD and NFKD, replace composite characters with their canonical decompositions. The other two forms, NFC and NFKC, perform the opposite operation: they replace sequences of characters with canonical composites, where possible.

Two of the normalization forms, NFD and NFC, do not affect compatibility characters. These normalization forms are non-lossy; that is, a string may be converted to NFD or NFC with no loss of information. The other two forms, NFKD and NFKC, replace compatibility characters with their nominal equivalents. As compatibility characters may differ in appearance from their nominal equivalents, information may be lost in converting a string to NFKD or NFKC. In other words, converting to NFKD or NFKC is a lossy operation.