SourcePro® API Reference Guide

Product Documentation:
   SourcePro
Documentation Home
List of all members | Public Types | Public Member Functions

Performs locale-sensitive string comparison for use in searching and sorting natural language text. More...

#include <rw/i18n/RWUCollator.h>

Public Types

enum  CaseOrder { Normal, LowerFirst, UpperFirst }
 
enum  CollationStrength {
  Primary, Secondary, Tertiary, Quaternary,
  Identical
}
 

Public Member Functions

 RWUCollator (const RWULocale &locale=RWULocale::getDefault())
 
 RWUCollator (const RWUCollator &original)
 
 ~RWUCollator (void)
 
int compareTo (const RWUString &lhs, const RWUString &rhs) const
 
void enableCaseLevel (bool caseLevel)
 
void enableFrenchCollation (bool frenchCollation)
 
void enableNormalizationChecking (bool check)
 
void enablePunctuationShifting (bool shift)
 
bool equals (const RWUString &lhs, const RWUString &rhs) const
 
CaseOrder getCaseOrder (void) const
 
RWUCollationKey getCollationKey (const RWUString &str) const
 
RWULocale getLocale (void) const
 
CollationStrength getStrength (void) const
 
bool isEnabledCaseLevel (void) const
 
bool isEnabledFrenchCollation (void) const
 
bool isEnabledNormalizationChecking (void) const
 
bool isEnabledPunctuationShifting (void) const
 
RWUCollatoroperator= (const RWUCollator &rhs)
 
void setCaseOrder (CaseOrder order)
 
void setStrength (CollationStrength strength)
 

Detailed Description

RWUCollator performs locale-sensitive string comparison for use in searching and sorting natural language text.

Each language has its own rules for determining the proper collation order for strings. For example, in Lithuanian, the letter y appears between i and k in the alphabet. In order to take language-specific conventions into account, each RWUCollator is associated with an RWULocale at construction time. This locale specifies the default values for a variety of RWUCollator attributes. Many of these default values can be overridden using attribute mutator methods.

RWUCollator follows the Unicode Collation Algorithm, as described in Unicode Technical Standard #10:

http://www.unicode.org/reports/tr10/.

This collation algorithm can be customized using the attribute mutator methods of the RWUCollator class. With these methods, you can specify how collation elements are found, how collation weights are formed, and which collation levels should be considered significant. See the Internationalization Module User's Guide for more information on collation.

RWUCollator calculates collation weights incrementally. This ensures good performance, as most strings differ in their first few characters. However, if string comparisons are to be made repeatedly (for example, when sorting a set of strings), then best performance can be achieved by obtaining an RWUCollationKey for each string and comparing the keys. Generating a key via RWUCollator::getCollationKey() is a non-trivial operation, as it involves determining the collation elements and weights for an entire string. Comparing two RWUCollationKey objects, however, is fast.

Example
#include <rw/i18n/RWUCollator.h>
#include <rw/i18n/RWUConversionContext.h>
#include <iostream>
using std::cout;
using std::endl;
int
main()
{
// Indicate string literals are encoded according to
// ISO-8859-1.
RWUConversionContext context("ISO-8859-1");
// Use implicit conversion to build two strings.
RWUString string1("Blackbird");
RWUString string2("black-bird");
// Create a collator based on the "en" locale.
RWUCollator collator("en");
// Modify the collator so it ignores differences
// in punctuation and case.
collator.enablePunctuationShifting(true);
collator.setStrength(RWUCollator::Secondary);
// Compare the two strings.
int retval = collator.compareTo(string1, string2);
if (retval < 0) {
cout << "string1 is less than string2" << endl;
} else if (retval == 0) {
cout << "string1 is equal to string2" << endl;
} else {
cout << "string1 is greater than string2" << endl;
} // else
return 0;
} // main

Program output:

string1 is equal to string2
See also
RWUCollationKey, RWUNormalizer

Member Enumeration Documentation

A CaseOrder value determines how characters are ordered at the tertiary level or, if enabled, the case level.

Enumerator
Normal 

characters are ordered in accordance with the Unicode Collation Charts. Typically, the lowercase version of a letter is ordered before all other versions.

LowerFirst 

lowercase letters, small kana, and uncased characters are ordered before mixed-case letters. Uppercase letters are ordered last.

UpperFirst 

uppercase letters are ordered before mixed-case letters. Lowercase letters, small kana, and uncased characters are ordered last.

A CollationStrength value indicates the level at which two collation elements should be considered equal.

Enumerator
Primary 

only primary differences are considered significant. Primary differences are locale-dependent, but are typically differences in basic character identity. An example of a primary difference is a != b.

Secondary 

both primary and secondary differences are considered significant. Secondary differences are locale-dependent, but are typically differences in diacritics. An example of a secondary difference is a != á.

Tertiary 

primary, secondary, and tertiary differences are considered significant. Tertiary differences are locale-dependent, but are typically differences in appearance, such as the differences between uppercase, lowercase, superscript, subscript, halfwidth, and circled versions of a character. An example of a tertiary difference is a != A.

Quaternary 

primary, secondary, tertiary, and quaternary differences are considered significant. Quaternary strength is useful only in two situations:

  • When punctuation shifting is enabled, whitespace and punctuation characters are ignored at the first three strength levels, and are distinguished at the quaternary level.
  • For Japanese locales, hiragana characters are positioned before katakana characters at the quaternary level, mimicking JIS sort order.
Identical 

all differences are considered significant. This strength level should be used sparingly. It rarely distinguishes between strings considered equal at the quaternary level, yet enacts a significant performance cost.

Constructor & Destructor Documentation

RWUCollator::RWUCollator ( const RWULocale locale = RWULocale::getDefault())

Constructs a new RWUCollator based on the given locale. Throws RWUException if any error occurs during the construction.

RWUCollator::RWUCollator ( const RWUCollator original)

Copy constructor. Makes self a deep copy of original. Throws RWUException if any error occurs during the construction.

RWUCollator::~RWUCollator ( void  )
inline

Destructor.

Member Function Documentation

int RWUCollator::compareTo ( const RWUString lhs,
const RWUString rhs 
) const

Compares the given strings, according to the dictates of this collator's attributes. Returns -1 if lhs < rhs, 0 if lhs == rhs, and 1 if lhs > rhs.

void RWUCollator::enableCaseLevel ( bool  caseLevel)

Sets whether case distinctions should be made at an extra "case level," positioned between the secondary and tertiary levels:

  • If self's strength is Primary, base character identity is taken into consideration, then case distinctions are made. Diacritics are not taken into account.
  • If self's strength is Secondary, base character identity, diacritics, and case distinctions are taken into account, in that order. Other tertiary distinctions, such as those between regular and superscript versions of a character, are not taken into account.
  • If self's strength is Tertiary, base character identity, diacritics, case distinctions, and other tertiary distinctions are taken into account, in that order.

At the case level, cased characters are ordered according to self's CaseOrder attribute.

void RWUCollator::enableFrenchCollation ( bool  frenchCollation)

Sets whether French collation rules should be in effect for self.

When French collation rules are in effect, the diacritical differences at the secondary strength level are compared in reverse order, from the end of each string to its start.

void RWUCollator::enableNormalizationChecking ( bool  check)

Sets whether self should perform normalization checks on input strings.

When normalization checking is disabled, self correctly compares strings that are in FCD (Fast C or D) form–that is, strings whose raw, recursive decomposition (without reordering of diacritics) results in a canonically-ordered string. Most strings in many languages are in FCD form.

In contrast, normalization checking is enabled by default for languages that use multiple combining characters, such as Arabic, Hebrew, Hindi, Thai, and Vietnamese. This ensures that input strings are normalized if necessary before collation. If, however, you know your strings are already in FCD form, you can improve performance slightly by disabling normalization checking.

void RWUCollator::enablePunctuationShifting ( bool  shift)

Sets whether the significance of punctuation and whitespace characters should be shifted from the primary strength level to the quaternary strength level.

bool RWUCollator::equals ( const RWUString lhs,
const RWUString rhs 
) const

Compares the given strings, according to the dictates of this collator's attributes. Returns true if lhs == rhs; otherwise, false.

CaseOrder RWUCollator::getCaseOrder ( void  ) const

Returns the current CaseOrder for self.

RWUCollationKey RWUCollator::getCollationKey ( const RWUString str) const

Returns an RWUCollationKey corresponding to the given string str. This key may be compared to other keys produced by collators with the same attributes.

RWULocale RWUCollator::getLocale ( void  ) const
inline

Returns the locale associated with self.

RWUCollator::CollationStrength RWUCollator::getStrength ( void  ) const
inline

Returns the CollationStrength associated with self.

bool RWUCollator::isEnabledCaseLevel ( void  ) const

Returns true if the case level is enabled; otherwise, false.

bool RWUCollator::isEnabledFrenchCollation ( void  ) const

Returns true if French collation rules are in effect; otherwise, false.

bool RWUCollator::isEnabledNormalizationChecking ( void  ) const

Returns true if normalization checking is enabled; otherwise, false.

bool RWUCollator::isEnabledPunctuationShifting ( void  ) const

Returns true if punctuation shifting is enabled; otherwise, false.

RWUCollator& RWUCollator::operator= ( const RWUCollator rhs)

Assignment operator. Makes self a deep copy of rhs. Throws RWUException if any error occurs during the construction.

void RWUCollator::setCaseOrder ( CaseOrder  order)

Sets the case ordering for self to order.

void RWUCollator::setStrength ( CollationStrength  strength)
inline

Sets the collation strength of self to strength.

Copyright © 2020 Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.