Localizing Alphabets with RWCString and RWWString

SourcePro Core : Essential Tools Module User's Guide : Internationalization Classes : Localizing Alphabets with RWCString and RWWString

The Essential Tools Module allows the use of multibyte and wide-character encodings. Because 8 bits is often not enough to represent all the character glyphs of various languages, the Essential Tools Module also allows two kinds of extensions: multibyte and wide-character encodings.

Multibyte encodings use a sequence of one or more bytes to represent a single character. (Typically the ASCII characters are still one byte long.) These encodings are compact, but may be inconvenient for indexing and substring operations. Wide character encodings, in contrast, place each character in a 16- or 32-bit integral type called a wchar_t, and represent a string as an array of wchar_t. Usually it is possible to translate a string encoded in one form into the other.

The two efficient string types in the Essential Tools Module, RWCString and RWWString, were discussed in Chapter 3. RWCString represents strings of 8-bit chars, with some support for multibyte strings. RWWString represents strings of wchar_t. Both provide access to Standard C Library support for local collation conventions with the member function collate() and the global function strXForm(). In addition, the library provides conversions between wide and multibyte representations. The wide- and multibyte-character encodings used are those of the host system.

But representation of alphabets can be even more complex. For example, is a character upper case, lower case, or neither? In a sorted list, where do you put the names that begin with accented letters? What about Cyrillic names? How are wide-character strings represented on byte streams? Standards bodies and corporate labs are discussing these issues, but the results are not yet portable. For the time being, the Essential Tools Module strives to make best use of what they provide.