Performs locale-sensitive string comparison for use in searching and sorting natural language text. More...
#include <rw/i18n/RWUCollator.h>
Public Types | |
enum | CaseOrder { Normal, LowerFirst, UpperFirst } |
enum | CollationStrength { Primary, Secondary, Tertiary, Quaternary, Identical } |
Public Member Functions | |
RWUCollator (const RWULocale &locale=RWULocale::getDefault()) | |
RWUCollator (const RWUCollator &original) | |
~RWUCollator (void) | |
int | compareTo (const RWUString &lhs, const RWUString &rhs) const |
void | enableCaseLevel (bool caseLevel) |
void | enableFrenchCollation (bool frenchCollation) |
void | enableNormalizationChecking (bool check) |
void | enablePunctuationShifting (bool shift) |
bool | equals (const RWUString &lhs, const RWUString &rhs) const |
CaseOrder | getCaseOrder (void) const |
RWUCollationKey | getCollationKey (const RWUString &str) const |
RWULocale | getLocale (void) const |
CollationStrength | getStrength (void) const |
bool | isEnabledCaseLevel (void) const |
bool | isEnabledFrenchCollation (void) const |
bool | isEnabledNormalizationChecking (void) const |
bool | isEnabledPunctuationShifting (void) const |
RWUCollator & | operator= (const RWUCollator &rhs) |
void | setCaseOrder (CaseOrder order) |
void | setStrength (CollationStrength strength) |
RWUCollator performs locale-sensitive string comparison for use in searching and sorting natural language text.
Each language has its own rules for determining the proper collation order for strings. For example, in Lithuanian, the letter y
appears between i
and k
in the alphabet. In order to take language-specific conventions into account, each RWUCollator is associated with an RWULocale at construction time. This locale specifies the default values for a variety of RWUCollator attributes. Many of these default values can be overridden using attribute mutator methods.
RWUCollator follows the Unicode Collation Algorithm, as described in Unicode Technical Standard #10:
http://www.unicode.org/reports/tr10/.
This collation algorithm can be customized using the attribute mutator methods of the RWUCollator class. With these methods, you can specify how collation elements are found, how collation weights are formed, and which collation levels should be considered significant. See the Internationalization Module User's Guide for more information on collation.
RWUCollator calculates collation weights incrementally. This ensures good performance, as most strings differ in their first few characters. However, if string comparisons are to be made repeatedly (for example, when sorting a set of strings), then best performance can be achieved by obtaining an RWUCollationKey for each string and comparing the keys. Generating a key via RWUCollator::getCollationKey() is a non-trivial operation, as it involves determining the collation elements and weights for an entire string. Comparing two RWUCollationKey objects, however, is fast.
Program output:
A CaseOrder value determines how characters are ordered at the tertiary level or, if enabled, the case level.
A CollationStrength value indicates the level at which two collation elements should be considered equal.
Enumerator | |
---|---|
Primary |
only primary differences are considered significant. Primary differences are locale-dependent, but are typically differences in basic character identity. An example of a primary difference is |
Secondary |
both primary and secondary differences are considered significant. Secondary differences are locale-dependent, but are typically differences in diacritics. An example of a secondary difference is |
Tertiary |
primary, secondary, and tertiary differences are considered significant. Tertiary differences are locale-dependent, but are typically differences in appearance, such as the differences between uppercase, lowercase, superscript, subscript, halfwidth, and circled versions of a character. An example of a tertiary difference is |
Quaternary |
primary, secondary, tertiary, and quaternary differences are considered significant. Quaternary strength is useful only in two situations:
|
Identical |
all differences are considered significant. This strength level should be used sparingly. It rarely distinguishes between strings considered equal at the quaternary level, yet enacts a significant performance cost. |
RWUCollator::RWUCollator | ( | const RWULocale & | locale = RWULocale::getDefault() | ) |
Constructs a new RWUCollator based on the given locale. Throws RWUException if any error occurs during the construction.
RWUCollator::RWUCollator | ( | const RWUCollator & | original | ) |
Copy constructor. Makes self a deep copy of original. Throws RWUException if any error occurs during the construction.
|
inline |
Destructor.
Compares the given strings, according to the dictates of this collator's attributes. Returns -1
if lhs <
rhs, 0
if lhs ==
rhs, and 1
if lhs >
rhs.
void RWUCollator::enableCaseLevel | ( | bool | caseLevel | ) |
Sets whether case distinctions should be made at an extra "case level," positioned between the secondary and tertiary levels:
At the case level, cased characters are ordered according to self's CaseOrder attribute.
void RWUCollator::enableFrenchCollation | ( | bool | frenchCollation | ) |
Sets whether French collation rules should be in effect for self.
When French collation rules are in effect, the diacritical differences at the secondary strength level are compared in reverse order, from the end of each string to its start.
void RWUCollator::enableNormalizationChecking | ( | bool | check | ) |
Sets whether self should perform normalization checks on input strings.
When normalization checking is disabled, self correctly compares strings that are in FCD (Fast C or D) form–that is, strings whose raw, recursive decomposition (without reordering of diacritics) results in a canonically-ordered string. Most strings in many languages are in FCD form.
In contrast, normalization checking is enabled by default for languages that use multiple combining characters, such as Arabic, Hebrew, Hindi, Thai, and Vietnamese. This ensures that input strings are normalized if necessary before collation. If, however, you know your strings are already in FCD form, you can improve performance slightly by disabling normalization checking.
void RWUCollator::enablePunctuationShifting | ( | bool | shift | ) |
Sets whether the significance of punctuation and whitespace characters should be shifted from the primary strength level to the quaternary strength level.
Compares the given strings, according to the dictates of this collator's attributes. Returns true
if lhs ==
rhs; otherwise, false
.
RWUCollationKey RWUCollator::getCollationKey | ( | const RWUString & | str | ) | const |
Returns an RWUCollationKey corresponding to the given string str. This key may be compared to other keys produced by collators with the same attributes.
|
inline |
Returns the locale associated with self.
|
inline |
Returns the CollationStrength associated with self.
bool RWUCollator::isEnabledCaseLevel | ( | void | ) | const |
Returns true
if the case level is enabled; otherwise, false
.
bool RWUCollator::isEnabledFrenchCollation | ( | void | ) | const |
Returns true
if French collation rules are in effect; otherwise, false
.
bool RWUCollator::isEnabledNormalizationChecking | ( | void | ) | const |
Returns true
if normalization checking is enabled; otherwise, false
.
bool RWUCollator::isEnabledPunctuationShifting | ( | void | ) | const |
Returns true
if punctuation shifting is enabled; otherwise, false
.
RWUCollator& RWUCollator::operator= | ( | const RWUCollator & | rhs | ) |
Assignment operator. Makes self a deep copy of rhs. Throws RWUException if any error occurs during the construction.
void RWUCollator::setCaseOrder | ( | CaseOrder | order | ) |
Sets the case ordering for self to order.
|
inline |
Sets the collation strength of self to strength.
Copyright © 2020 Rogue Wave Software, Inc. All Rights Reserved. |