Unicode Character Encoding Forms

SourcePro Core : Internationalization Module User’s Guide : Concepts : The Unicode Standard : Unicode Character Encoding Forms

Any character in the Unicode character set can be expressed using 21-bits. The Unicode Standard defines three character encoding forms for representing each 21-bit code point in memory:

UTF-8

Each 21-bit code point is represented using one to four 8-bit code units.

UTF-16

Each 21-bit code point is represented using one or two 16-bit code units.

UTF-32

Each 21-bit code point is represented using a single 32-bit code unit.

The UTF-16 encoding form strikes a balance between ease of use and efficient use of memory. Most characters can be represented with a single 16-bit code unit. Only characters in the range 0x10000 to 0x10FFFF must be represented with a surrogate pair of two UTF-16 code units.

The Internationalization Module uses UTF-16 for the internal representation and manipulation of multilingual text.