SourcePro® API Reference Guide

 
List of all members | Public Types | Public Member Functions
RWURegularExpression Class Reference

Represents a regular expression with Unicode extensions. More...

#include <rw/i18n/RWURegularExpression.h>

Public Types

enum  Options { Normal, IgnoreCase, InterpretGraphemes }
 
enum  Status {
  Ok, MissingEscapeSequence, InvalidHexNibble, InsufficientHex8Data,
  InsufficientHex16Data, MissingClosingBracket, MissingClosingCurlyBrace, MissingClosingParen,
  UnmatchedClosingParen, InvalidSubexpression, InvalidDataAfterOr, InvalidDataBeforeOr,
  ConsecutiveCardinalities, InvalidCardinalityRange, LeadingCardinality, InvalidDecimalDigit,
  UnmatchedClosingCurly, NeverEndingCategoryName, InvalidCategoryName, InfiniteEmptyMatch,
  ASCIIConversionError, InvalidGraphemeCluster, NumberOfStatusCodes
}
 
enum  UnicodeConformanceLevel { Basic, Tailored }
 

Public Member Functions

 RWURegularExpression ()
 
 RWURegularExpression (const RWURegularExpression &source)
 
 RWURegularExpression (const char *pattern, UnicodeConformanceLevel level=Basic, int32_t options=int32_t(Normal), const RWULocale &locale=RWULocale::getDefault(), RWUToUnicodeConverter &converter=RWUToUnicodeConversionContext::getContext().getConverter())
 
 RWURegularExpression (const RWCString &pattern, UnicodeConformanceLevel level=Basic, int32_t options=int32_t(Normal), const RWULocale &locale=RWULocale::getDefault(), RWUToUnicodeConverter &converter=RWUToUnicodeConversionContext::getContext().getConverter())
 
 RWURegularExpression (const RWUString &pattern, UnicodeConformanceLevel level=Basic, int32_t options=int32_t(Normal), const RWULocale &locale=RWULocale::getDefault())
 
 ~RWURegularExpression ()
 
RWUCollator::CollationStrength getCollationStrength () const
 
UnicodeConformanceLevel getLevel () const
 
RWULocale getLocale () const
 
int32_t getOptions () const
 
RWUString getPattern () const
 
RWURegexResult matchAt (const RWUString &str) const
 
RWURegexResult matchAt (const RWUString &str, const RWUConstStringIterator &start) const
 
RWURegexResult matchAt (const RWUString &str, const RWUConstStringIterator &start, const RWUConstStringIterator &end) const
 
bool operator< (const RWURegularExpression &rhs)
 
RWURegularExpressionoperator= (const RWURegularExpression &rhs)
 
bool operator== (const RWURegularExpression &rhs)
 
size_t replace (RWUString &str, const RWUString &replacement, size_t count=size_t(1), int32_t matchID=0) const
 
size_t replace (RWUString &str, const RWUString &replacement, size_t count, int32_t matchID, const RWUConstStringIterator &start) const
 
size_t replace (RWUString &str, const RWUString &replacement, size_t count, int32_t matchID, const RWUConstStringIterator &start, const RWUConstStringIterator &end, bool replaceEmptyMatches=true) const
 
RWURegexResult search (const RWUString &str) const
 
RWURegexResult search (const RWUString &str, const RWUConstStringIterator &start) const
 
RWURegexResult search (const RWUString &str, const RWUConstStringIterator &start, const RWUConstStringIterator &end) const
 
void setCollationStrength (RWUCollator::CollationStrength)
 
void setLevel (UnicodeConformanceLevel level=Basic)
 
void setLocale (const RWULocale &loc)
 
size_t subCount () const
 

Detailed Description

RWURegularExpression supports regular expressions with Unicode extensions.

A regular expression is a string pattern composed of normal characters and special characters. Special characters are used to denote an arrangement of the other characters in the regular expression pattern. A regular expression can be used to search for, and perhaps replace, occurrences of the regular expression pattern in strings.

Regular expression syntax describes how to arrange normal characters and special characters to form a valid regular expression pattern. The regular expression syntax for RWURegularExpression is similar to that of the POSIX 2 extended regular expression (ERE) specification. For more information see the Internationalization Module User's Guide.

RWURegularExpression extends the POSIX 2 ERE syntax to provide support for Unicode basic and tailored regular expressions.

Basic Unicode regular expression support corresponds to Level 1 support, as described in the Unicode Regular Expression Guidelines ( Unicode Technical Report #18 (UTR-18) Version 5.1). Basic Unicode regular expressions are useful for the majority of Unicode strings, and extend the POSIX ERE standard with the following Unicode extensions:

Tailored Unicode regular expressions extend the basic regular expression functionality, corresponding to Level 2 and Level 3 support, also described in UTR-18 Version 5.1. In addition to some minor additions, tailored extensions include support for:

For more information on basic and tailored regular expression support in the Internationalization Module, see the Internationalization Module User's Guide.

The Role of the Locale in a Regular Expression

RWURegularExpression accepts an RWULocale argument in its constructor, or via the setLocale() method.The regular expression instance uses the locale to determine locale-specific behavior in a tailored regular expression (Locales have little effect on basic regular expressions). Grapheme clusters, character sets, and the break locations for words, sentences and lines may change depending on locale. For example, the Spanish character 'ch' is found in the character set "[b-d]" in Spanish locales, but not in English.

For more information on creating regular expressions, see the Internationalization Module User's Guide.

Example
#include <rw/i18n/RWURegularExpression.h>
#include <rw/i18n/RWUConversionContext.h>
#include <rw/i18n/RWUString.h>
#include <iostream>
using std::cout;
using std::endl;
int main()
{
// Indicate string literals are encoded as US-ASCII strings.
RWUConversionContext context("US-ASCII");
// Create a string in which to search.
RWUString text("The quick brown fox.");
// Create a regular expression to search for "own" as a
// distinct word. The character category [{WB}] will be
// interpreted in terms of the default locale. Use
// RWURegularExpression::setLocale() to interpret breaks
// in terms of a different locale.
RWURegularExpression regexp("[{WB}]own[{WB}]");
// This search should fail because "own" appears only
// within the word "brown" and not as a distinct word.
RWURegexResult result = regexp.search(text);
if (result) {
cout << "Overall match at offset " << int32_t(result.begin(text))
<< " with length " << result.getLength() << "." << endl;
} else {
cout << "No match" << endl;
} // else
// Create a regular expression to search for "quick" as
// a distinct word.
regexp = RWURegularExpression("[{WB}]quick[{WB}]");
// This search should succeed.
result = regexp.search(text);
if (result) {
cout << "Overall match at offset " << int32_t(result.begin(text))
<< " with length " << result.getLength() << "." << endl;
} else {
cout << "No match" << endl;
} // else
return 0;
} // main

Program output:

No match
Overall match at offset 4 with length 5.
See also
RWUStringSearch

Member Enumeration Documentation

Lists options for changing the behavior of RWURegularExpression pattern matching.

Enumerator
Normal 

Specifies normal pattern matching operations, with no special options enabled.

IgnoreCase 

Indicates that characters in the pattern string and search string should be compared without regard to case.

InterpretGraphemes 

This option is valid only with Tailored regular expressions. This option causes the pattern compiler to recognize graphemes, such as "a\u0308", as a single unit. This changes, for example, how cardinalities are applied. For example, with this setting, "a\u0308*" matches zero or more occurrences of anything equivalent to "a\u0308", whereas without this option, the pattern would match an 'a', followed by zero or more occurrences of "\u0308".

Further, this option changes the behavior of '.'. With this option, '.' matches any logical character including graphemes (except those outlined above). Without the option, '.' matches any code point except for one which indicates the end of a logical line. (For a list of specific characters excepted, see the Internationalization Module User's Guide.)

Lists regular expression pattern error codes that could be reported during regular expression pattern compilation. These error codes are reported through an exception of type RWRegexErr.

Enumerator
Ok 

Indicates that the pattern has been successfully compiled.

MissingEscapeSequence 

Indicates a missing escape sequence, as in "ab\".

InvalidHexNibble 

Indicates an invalid hexadecimal escape sequence, as in "ab\u00fg".

InsufficientHex8Data 

Indicates an insufficient number of hex nibbles in an 8-bit hexadecimal escape sequence, as in "ab\x0".

InsufficientHex16Data 

Indicates an insufficient number of hex nibbles in a 16-bit hexadecimal escape sequence, as in "ab\u00f".

MissingClosingBracket 

Indicates a missing closing bracket on a bracket expression, as in "ab[cd".

MissingClosingCurlyBrace 

Indicates a missing closing curly brace in a cardinality specification, as in "(abc){2,3".

MissingClosingParen 

Indicates a missing closing parenthesis in a sub-expression definition, as in "ab(c(d)ef".

UnmatchedClosingParen 

Indicates that a closing parenthesis was found, for which there is no opening parenthesis, as in "ab(cd)e)f".

InvalidSubexpression 

Indicates that an invalid sub-expression specification has been encountered, such as "ab(*cd)".

InvalidDataAfterOr 

Indicates that the character following an alternation symbol, "|", was considered invalid, as in "ab|*cd", or "ab||cd".

InvalidDataBeforeOr 

Indicates that the data preceding an alternation symbol, "|", was considered invalid, as in "|", "|bc", and "ab(|cd)".

ConsecutiveCardinalities 

Indicates that consecutive cardinality specifiers were found in the pattern, as in "a*+" or "ab{2,3}*".

InvalidCardinalityRange 

Indicates that an invalid cardinality range was specified, as in "ab{,}", and "a{}".

LeadingCardinality 

Specifies that a leading cardinality specifier was encountered, as in "*a".

InvalidDecimalDigit 

Specifies that an invalid decimal digit was encountered in a pattern string, as in "ab{3,a}".

UnmatchedClosingCurly 

Indicates that a closing curly brace was encountered for which there was no matching opening curly brace, as in "ab2,3}".

NeverEndingCategoryName 

Indicates that a category name was started, but that no closing curly brace was found to end the category name, as in "[{L]+123".

InvalidCategoryName 

Indicates that an unrecognized category name was specified in a bracket expression, as in "[{Smile}]"

InfiniteEmptyMatch 

Indicates that a category that could produce a zero-length match was found with infinite cardinality. Such categories include: Word Break "WB", Character Break "CB", Line Break "LB", Sentence Break "SB", Beginning of Line "BOL", and End of Line "EOL". As such, the following are invalid: "[{WB}]*", or "ab([{WB}])*cd".

ASCIIConversionError 

Indicates that a problem was encountered while converting an US-ASCII pattern string to UTF16. This can occur only when using the RWCString conversion constructor.

InvalidGraphemeCluster 

Indicates that an invalid grapheme cluster specification was found. This implies that the grapheme cluster did not follow the syntax, "\g{...}", where "..." is any sequence of code units. For example, "\gab}" is invalid because of a missing opening curly brace.

NumberOfStatusCodes 

Indicates the number of status codes potentially reported during the compilation of regular expression patterns.

Describes the levels of Unicode Regular Expression support available through RWURegularExpression. Two levels are available: Basic (Level 1), and Tailored (Levels 2 and 3). Both are described in Version 5.1 of Unicode Technical Report #18 and the Internationalization Module User's Guide.

Enumerator
Basic 

Specifies Basic Unicode regular expression support.

Tailored 

Specifies Tailored Unicode regular expression suppor t, which adds full support for surrogates, and locale-based handling of graphemes and string collation.

Constructor & Destructor Documentation

RWURegularExpression::RWURegularExpression ( )

Default constructor. Creates an empty regular expression pattern object that does not match any input string.

RWURegularExpression::RWURegularExpression ( const RWURegularExpression source)

Copy constructor. Creates a copy of the source RWURegularExpression object.

Exceptions
std::bad_allocThrown if memory resources are exhausted during pattern compilation.
RWURegularExpression::RWURegularExpression ( const char *  pattern,
UnicodeConformanceLevel  level = Basic,
int32_t  options = int32_t(Normal),
const RWULocale locale = RWULocale::getDefault(),
RWUToUnicodeConverter converter = RWUToUnicodeConversionContext::getContext().getConverter() 
)
explicit

Constructs an RWURegularExpression from the null-terminated char* pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any escape sequences are handled as for RWUString::unescape().

  • The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
  • The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.
Exceptions
std::bad_allocThrown if memory resources are exhausted during pattern compilation.
RWRegexErrThrown to report pattern compilation errors.
RWURegularExpression::RWURegularExpression ( const RWCString pattern,
UnicodeConformanceLevel  level = Basic,
int32_t  options = int32_t(Normal),
const RWULocale locale = RWULocale::getDefault(),
RWUToUnicodeConverter converter = RWUToUnicodeConversionContext::getContext().getConverter() 
)
explicit

Constructs an RWURegularExpression from the RWCString pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any escape sequences are handled as for RWUString::unescape().

  • The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
  • The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.
Exceptions
std::bad_allocThrown if memory resources are exhausted during pattern compilation.
RWRegexErrThrown to report pattern compilation errors.
RWURegularExpression::RWURegularExpression ( const RWUString pattern,
UnicodeConformanceLevel  level = Basic,
int32_t  options = int32_t(Normal),
const RWULocale locale = RWULocale::getDefault() 
)
explicit

Constructs an RWURegularExpression from the RWUString pattern.

  • The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
  • The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.
Exceptions
std::bad_allocThrown if memory resources are exhausted during pattern compilation.
RWRegexErrThrown to report pattern compilation errors.
RWURegularExpression::~RWURegularExpression ( )

Destructor.

Member Function Documentation

RWUCollator::CollationStrength RWURegularExpression::getCollationStrength ( ) const

Returns the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions.

Exceptions
RWUExceptionThrown if invoked on a basic regular expression.
UnicodeConformanceLevel RWURegularExpression::getLevel ( ) const

Returns the current level of Unicode regular expression support associated with self.

RWULocale RWURegularExpression::getLocale ( ) const

Returns a copy of the locale used by self.

int32_t RWURegularExpression::getOptions ( ) const

Returns the pattern matching Options associated with self as an int32_t bit-mask.

RWUString RWURegularExpression::getPattern ( ) const

Returns the RWUString pattern string currently associated with self.

RWURegexResult RWURegularExpression::matchAt ( const RWUString str) const
inline

Tests for a match for this regular expression at the first character position in input string str. Does not find matches that begin after this position.

RWURegexResult RWURegularExpression::matchAt ( const RWUString str,
const RWUConstStringIterator start 
) const
inline

Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches that begin other than at this position.

RWURegexResult RWURegularExpression::matchAt ( const RWUString str,
const RWUConstStringIterator start,
const RWUConstStringIterator end 
) const

Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches at other than the start position or that end after the end position.

bool RWURegularExpression::operator< ( const RWURegularExpression rhs)

Compares two regular expression objects. The comparison is performed using RWUString::operator<() to compare the pattern strings stored in each regular expression. Returns true if self's pattern is less than the rhs pattern; otherwise, false.

RWURegularExpression& RWURegularExpression::operator= ( const RWURegularExpression rhs)

Assigns the rhs regular expression object to self.

bool RWURegularExpression::operator== ( const RWURegularExpression rhs)

Compares two regular expression objects. The comparison is performed using RWUString::operator==() to compare the pattern strings stored in each regular expression. Returns true if self's pattern is equal to the rhs pattern; otherwise, false.

size_t RWURegularExpression::replace ( RWUString str,
const RWUString replacement,
size_t  count = size_t(1),
int32_t  matchID = 0 
) const
inline

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1. Specifying a count of 0 replaces all occurrences of the pattern. Returns the number of replacements. Empty (zero-length) matches are replaced.

size_t RWURegularExpression::replace ( RWUString str,
const RWUString replacement,
size_t  count,
int32_t  matchID,
const RWUConstStringIterator start 
) const
inline

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1. Specifying a count of 0 replaces all occurrences of the pattern. The search for pattern matches begins at the specified start position. Returns the number of replacements. Empty (zero-length) matches are replaced.

size_t RWURegularExpression::replace ( RWUString str,
const RWUString replacement,
size_t  count,
int32_t  matchID,
const RWUConstStringIterator start,
const RWUConstStringIterator end,
bool  replaceEmptyMatches = true 
) const

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. Specifying a count of 0 replaces all occurrences of the pattern. The search for pattern matches begins at a specified start position. No match that extends beyond the specified end position is replaced. The method also allows you to specify whether or not empty (zero-length) matches should be replaced; the default is true.

RWURegexResult RWURegularExpression::search ( const RWUString str) const
inline

Searches input string str for substrings that match this regular expression. The search begins at the beginning of the string, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.

RWURegexResult RWURegularExpression::search ( const RWUString str,
const RWUConstStringIterator start 
) const
inline

Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.

RWURegexResult RWURegularExpression::search ( const RWUString str,
const RWUConstStringIterator start,
const RWUConstStringIterator end 
) const

Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the specified end position is reached, or a match is found. No match that extends beyond the specified end position is found. Returns an instance of RWURegexResult to report the result of the operation.

void RWURegularExpression::setCollationStrength ( RWUCollator::CollationStrength  )

Sets the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions.

Exceptions
RWUExceptionThrown if this method is invoked on a basic regular expression.
void RWURegularExpression::setLevel ( UnicodeConformanceLevel  level = Basic)

Sets the Unicode conformance level for self to the specified level. The default is Basic.

Note
The regular expression pattern will be recompiled into a form that more efficiently allows for the specified level of Unicode support.
void RWURegularExpression::setLocale ( const RWULocale loc)

Imbues a locale on the regular expression object. The locale is used internally in the detection of breaks in the text.

size_t RWURegularExpression::subCount ( ) const

Returns the count of parenthesized subexpressions contained in the regular expression pattern associated with self. For example, in the pattern a(b(c)d)e, there are two parenthesized subexpressions.

Copyright © 2022 Rogue Wave Software, Inc., a Perforce company. All Rights Reserved.