Collation

SourcePro : Internationalization Module User’s Guide : Collation

Collation

Overview

Collation is the process of comparing strings for equality and ordering. As described in Chapter 3, you can use the compareTo() method and the global comparison operators on RWUString to perform lexical comparisons. Lexical comparisons are fast, and if two strings contain characters from the same script, and are in the same normalization form (see Chapter 5), lexical comparisons may be adequate for many purposes.

However, lexical comparisons are unlikely to match an end user’s expectations regarding string equality and ordering, because each language has its own rules for determining the proper collation order for strings. For instance:

The letters A-Z are sorted differently in different languages. For example, in Lithuanian y is sorted between i and k. In Swedish v and w are variant forms of the same letter.

In some languages, combinations of letters are treated as if they were one letter, while in other languages single letters are treated as if they were two letters. For example, in traditional Spanish ch is treated as a single letter, and is sorted between c and d, while in traditional German ä is compared as if it were ae.

In some languages, accented letters are treated as distinct letters. For example, in Danish Å is a distinct letter that sorts following Z. In other languages, accented letters are just minor variants of unaccented letters.

In some languages, such as English, lowercase letters are usually sorted before uppercase letters. In other languages, such as Latvian, the reverse is true.

These are just a few examples of how languages can vary in ordering strings. Furthermore, collation rules in a particular locale can change over time due to government regulations or the addition of new characters or scripts to Unicode.

Therefore, the Internationalization Module includes collation classes that support locale-sensitive string comparisons. By taking into account the collation order used in a particular locale, these classes make it possible to sort strings in accordance with the conventions of that locale:

RWUCollator performs locale-sensitive string comparison in accordance with the Unicode Collation Algorithm. The collation rules are highly customizable. For example, you can specify whether differences in case or punctuation should be considered significant.

RWUCollationKey contains preprocessed comparison information. Generated by RWUCollator, RWUCollationKey objects can be used to speed repeated string comparisons; for example, when sorting a set of strings.

This chapter describes:

how to perform locale-sensitive string comparisons with RWUCollator

how to customize an RWUCollator

how to use collation keys to speed repeated string comparisons