Represents a regular expression with Unicode extensions. More...
#include <rw/i18n/RWURegularExpression.h>
RWURegularExpression supports regular expressions with Unicode extensions.
A regular expression is a string pattern composed of normal characters and special characters. Special characters are used to denote an arrangement of the other characters in the regular expression pattern. A regular expression can be used to search for, and perhaps replace, occurrences of the regular expression pattern in strings.
Regular expression syntax describes how to arrange normal characters and special characters to form a valid regular expression pattern. The regular expression syntax for RWURegularExpression is similar to that of the POSIX 2 extended regular expression (ERE) specification. For more information see the Internationalization Module User's Guide.
RWURegularExpression extends the POSIX 2 ERE syntax to provide support for Unicode basic and tailored regular expressions.
Basic Unicode regular expression support corresponds to Level 1 support, as described in the Unicode Regular Expression Guidelines ( Unicode Technical Report #18 (UTR-18) Version 5.1). Basic Unicode regular expressions are useful for the majority of Unicode strings, and extend the POSIX ERE standard with the following Unicode extensions:
Tailored Unicode regular expressions extend the basic regular expression functionality, corresponding to Level 2 and Level 3 support, also described in UTR-18 Version 5.1. In addition to some minor additions, tailored extensions include support for:
For more information on basic and tailored regular expression support in the Internationalization Module, see the Internationalization Module User's Guide.
The Role of the Locale in a Regular Expression
RWURegularExpression accepts an RWULocale argument in its constructor, or via the setLocale() method.The regular expression instance uses the locale to determine locale-specific behavior in a tailored regular expression (Locales have little effect on basic regular expressions). Grapheme clusters, character sets, and the break locations for words, sentences and lines may change depending on locale. For example, the Spanish character 'ch'
is found in the character set "[b-d]"
in Spanish locales, but not in English.
For more information on creating regular expressions, see the Internationalization Module User's Guide.
Program output:
Lists options for changing the behavior of RWURegularExpression pattern matching.
Enumerator | |
---|---|
Normal |
Specifies normal pattern matching operations, with no special options enabled. |
IgnoreCase |
Indicates that characters in the pattern string and search string should be compared without regard to case. |
InterpretGraphemes |
This option is valid only with Tailored regular expressions. This option causes the pattern compiler to recognize graphemes, such as Further, this option changes the behavior of |
Lists regular expression pattern error codes that could be reported during regular expression pattern compilation. These error codes are reported through an exception of type RWRegexErr.
Enumerator | |
---|---|
Ok |
Indicates that the pattern has been successfully compiled. |
MissingEscapeSequence |
Indicates a missing escape sequence, as in |
InvalidHexNibble |
Indicates an invalid hexadecimal escape sequence, as in |
InsufficientHex8Data |
Indicates an insufficient number of hex nibbles in an 8-bit hexadecimal escape sequence, as in |
InsufficientHex16Data |
Indicates an insufficient number of hex nibbles in a 16-bit hexadecimal escape sequence, as in |
MissingClosingBracket |
Indicates a missing closing bracket on a bracket expression, as in |
MissingClosingCurlyBrace |
Indicates a missing closing curly brace in a cardinality specification, as in |
MissingClosingParen |
Indicates a missing closing parenthesis in a sub-expression definition, as in |
UnmatchedClosingParen |
Indicates that a closing parenthesis was found, for which there is no opening parenthesis, as in |
InvalidSubexpression |
Indicates that an invalid sub-expression specification has been encountered, such as |
InvalidDataAfterOr |
Indicates that the character following an alternation symbol, |
InvalidDataBeforeOr |
Indicates that the data preceding an alternation symbol, |
ConsecutiveCardinalities |
Indicates that consecutive cardinality specifiers were found in the pattern, as in |
InvalidCardinalityRange |
Indicates that an invalid cardinality range was specified, as in |
LeadingCardinality |
Specifies that a leading cardinality specifier was encountered, as in |
InvalidDecimalDigit |
Specifies that an invalid decimal digit was encountered in a pattern string, as in |
UnmatchedClosingCurly |
Indicates that a closing curly brace was encountered for which there was no matching opening curly brace, as in |
NeverEndingCategoryName |
Indicates that a category name was started, but that no closing curly brace was found to end the category name, as in |
InvalidCategoryName |
Indicates that an unrecognized category name was specified in a bracket expression, as in |
InfiniteEmptyMatch |
Indicates that a category that could produce a zero-length match was found with infinite cardinality. Such categories include: Word Break |
ASCIIConversionError |
Indicates that a problem was encountered while converting an US-ASCII pattern string to UTF16. This can occur only when using the RWCString conversion constructor. |
InvalidGraphemeCluster |
Indicates that an invalid grapheme cluster specification was found. This implies that the grapheme cluster did not follow the syntax, |
NumberOfStatusCodes |
Indicates the number of status codes potentially reported during the compilation of regular expression patterns. |
Describes the levels of Unicode Regular Expression support available through RWURegularExpression. Two levels are available: Basic (Level 1), and Tailored (Levels 2 and 3). Both are described in Version 5.1 of Unicode Technical Report #18 and the Internationalization Module User's Guide.
RWURegularExpression::RWURegularExpression | ( | ) |
Default constructor. Creates an empty regular expression pattern object that does not match any input string.
RWURegularExpression::RWURegularExpression | ( | const RWURegularExpression & | source | ) |
Copy constructor. Creates a copy of the source RWURegularExpression object.
std::bad_alloc | Thrown if memory resources are exhausted during pattern compilation. |
|
explicit |
Constructs an RWURegularExpression from the null-terminated char*
pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any escape sequences are handled as for RWUString::unescape().
int32_t
bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.std::bad_alloc | Thrown if memory resources are exhausted during pattern compilation. |
RWRegexErr | Thrown to report pattern compilation errors. |
|
explicit |
Constructs an RWURegularExpression from the RWCString pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any escape sequences are handled as for RWUString::unescape().
int32_t
bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.std::bad_alloc | Thrown if memory resources are exhausted during pattern compilation. |
RWRegexErr | Thrown to report pattern compilation errors. |
|
explicit |
Constructs an RWURegularExpression from the RWUString pattern.
int32_t
bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.std::bad_alloc | Thrown if memory resources are exhausted during pattern compilation. |
RWRegexErr | Thrown to report pattern compilation errors. |
RWURegularExpression::~RWURegularExpression | ( | ) |
Destructor.
RWUCollator::CollationStrength RWURegularExpression::getCollationStrength | ( | ) | const |
Returns the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions.
RWUException | Thrown if invoked on a basic regular expression. |
UnicodeConformanceLevel RWURegularExpression::getLevel | ( | ) | const |
Returns the current level of Unicode regular expression support associated with self.
RWULocale RWURegularExpression::getLocale | ( | ) | const |
Returns a copy of the locale used by self.
int32_t RWURegularExpression::getOptions | ( | ) | const |
Returns the pattern matching Options associated with self as an int32_t
bit-mask.
RWUString RWURegularExpression::getPattern | ( | ) | const |
Returns the RWUString pattern
string currently associated with self.
|
inline |
Tests for a match for this regular expression at the first character position in input string str. Does not find matches that begin after this position.
|
inline |
Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches that begin other than at this position.
RWURegexResult RWURegularExpression::matchAt | ( | const RWUString & | str, |
const RWUConstStringIterator & | start, | ||
const RWUConstStringIterator & | end | ||
) | const |
Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches at other than the start position or that end after the end position.
bool RWURegularExpression::operator< | ( | const RWURegularExpression & | rhs | ) |
Compares two regular expression objects. The comparison is performed using RWUString::operator<() to compare the pattern strings stored in each regular expression. Returns true
if self's pattern is less than the rhs pattern; otherwise, false
.
RWURegularExpression& RWURegularExpression::operator= | ( | const RWURegularExpression & | rhs | ) |
Assigns the rhs regular expression object to self.
bool RWURegularExpression::operator== | ( | const RWURegularExpression & | rhs | ) |
Compares two regular expression objects. The comparison is performed using RWUString::operator==() to compare the pattern strings stored in each regular expression. Returns true
if self's pattern is equal to the rhs pattern; otherwise, false
.
|
inline |
Replaces substrings in str that match this regular expression with the specified replacement string. Up to count
occurrences are replaced. The default count is 1
. Specifying a count of 0
replaces all occurrences of the pattern. Returns the number of replacements. Empty (zero-length) matches are replaced.
|
inline |
Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1
. Specifying a count of 0
replaces all occurrences of the pattern. The search for pattern matches begins at the specified start position. Returns the number of replacements. Empty (zero-length) matches are replaced.
size_t RWURegularExpression::replace | ( | RWUString & | str, |
const RWUString & | replacement, | ||
size_t | count, | ||
int32_t | matchID, | ||
const RWUConstStringIterator & | start, | ||
const RWUConstStringIterator & | end, | ||
bool | replaceEmptyMatches = true |
||
) | const |
Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. Specifying a count of 0
replaces all occurrences of the pattern. The search for pattern matches begins at a specified start position. No match that extends beyond the specified end position is replaced. The method also allows you to specify whether or not empty (zero-length) matches should be replaced; the default is true
.
|
inline |
Searches input string str for substrings that match this regular expression. The search begins at the beginning of the string, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.
|
inline |
Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.
RWURegexResult RWURegularExpression::search | ( | const RWUString & | str, |
const RWUConstStringIterator & | start, | ||
const RWUConstStringIterator & | end | ||
) | const |
Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the specified end position is reached, or a match is found. No match that extends beyond the specified end position is found. Returns an instance of RWURegexResult to report the result of the operation.
void RWURegularExpression::setCollationStrength | ( | RWUCollator::CollationStrength | ) |
Sets the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions.
RWUException | Thrown if this method is invoked on a basic regular expression. |
void RWURegularExpression::setLevel | ( | UnicodeConformanceLevel | level = Basic | ) |
Sets the Unicode conformance level for self to the specified level. The default is Basic.
void RWURegularExpression::setLocale | ( | const RWULocale & | loc | ) |
Imbues a locale on the regular expression object. The locale is used internally in the detection of breaks in the text.
size_t RWURegularExpression::subCount | ( | ) | const |
Returns the count of parenthesized subexpressions contained in the regular expression pattern associated with self. For example, in the pattern a(b(c)d)e
, there are two parenthesized subexpressions.
Copyright © 2020 Rogue Wave Software, Inc. All Rights Reserved. |