Character Encoding Forms

SourcePro Core : Internationalization Module User’s Guide : Concepts : Representing Text in Computers : Character Encoding Forms

In order to represent characters in a computer, each code point in a coded character set must be mapped to a sequence of bits. This mapping is called a character encoding form.

A code unit is the fundamental binary width used in a computer architecture for representing character data, such as 7 bits, 8 bits, 16 bits, or 32 bits. Depending on the character encoding form used, each code point in a coded character set may be represented internally by one or more such code units.

A character encoding form whose code unit sequences are all of the same length is known as a fixed width encoding. For example, single-byte character sets (SBCS) are fixed width. If a double-byte character set (DBCS) always uses two code units to represent a code point, then it is also fixed width.

A character encoding form whose sequences are not all of the same length is known as a variable width encoding. If a double-byte character set uses one or two code units to represent a code point, then it is a variable width encoding. Multibyte character sets (MBCS) are variable width.

Examples of character encoding forms include:

US ASCII, a 7-bit fixed width encoding form

ISO 8859-1, an 8-bit fixed width encoding form

CP 037 and CP 500, 8-bit fixed width EBCDIC encoding forms

Windows CP 1252, an 8-bit fixed width encoding form

Shift-JIS, a 16-bit variable width encoding form for JIS X 0208

UTF-8, a variable width 8-bit encoding form for Unicode 3.0

UTF-16, a variable width 16-bit encoding form for Unicode 3.0

UTF-32, a fixed-width 32-bit encoding form for Unicode 3.0