An Introduction to Unicode

SourcePro Core : Advanced Tools Module User’s Guide : PART II Advanced Tools Module Packages : International Features of the Advanced Tools Module : Introduction : An Introduction to Unicode

This chapter refers to the Unicode encoding forms UTF-8 and UTF-16, as they are used internally by the SourcePro’s modules to store and manipulate text. This section provides an overview of these terms.

The Unicode Standard is able to encode all characters used for nearly all written languages in the world. It defines three main encoding forms: UTF-8, UTF-16, and UTF-32.

The Advanced Tools Module offers conversion of strings and streams to and from UTF-8 and UTF-16. The Internationalization Module offers conversions to and from any character encoding and UTF-16.

Each encoding form serves a different purpose, offering a programmer the opportunity to select the best development strategy, given the application’s requirements and the system’s memory requirements.

UTF-8 uses 8-bit code units to represent each 21-bit Unicode code point. Storing a character may take from one to four code units. This form offers backwards compatibility with ASCII-based APIs and other protocols and is the likely choice when the required character set is US-ASCII or non-Asian.

UTF-16 uses 16-bit code units to represent each 21-bit Unicode code point. It is the encoding form used by the RWUString class of the Internationalization Module and may contain either one or two 16-bit code units per character. UTF-16 is usually a good choice for most Asian character sets.