Represents a regular expression with Unicode extensions. More...

#include <rw/i18n/RWURegularExpression.h>

Public Types
enum	Options { Normal , IgnoreCase , InterpretGraphemes }

enum	Status { Ok , MissingEscapeSequence , InvalidHexNibble , InsufficientHex8Data , InsufficientHex16Data , MissingClosingBracket , MissingClosingCurlyBrace , MissingClosingParen , UnmatchedClosingParen , InvalidSubexpression , InvalidDataAfterOr , InvalidDataBeforeOr , ConsecutiveCardinalities , InvalidCardinalityRange , LeadingCardinality , InvalidDecimalDigit , UnmatchedClosingCurly , NeverEndingCategoryName , InvalidCategoryName , InfiniteEmptyMatch , ASCIIConversionError , InvalidGraphemeCluster , NumberOfStatusCodes }

enum	UnicodeConformanceLevel { Basic , Tailored }

Public Member Functions
	RWURegularExpression ()

	RWURegularExpression (const char *pattern, UnicodeConformanceLevel level=Basic, int32_t options=int32_t(Normal), const RWULocale &locale=RWULocale::getDefault(), RWUToUnicodeConverter &converter=RWUToUnicodeConversionContext::getContext().getConverter())

	RWURegularExpression (const RWCString &pattern, UnicodeConformanceLevel level=Basic, int32_t options=int32_t(Normal), const RWULocale &locale=RWULocale::getDefault(), RWUToUnicodeConverter &converter=RWUToUnicodeConversionContext::getContext().getConverter())

	RWURegularExpression (const RWURegularExpression &source)

	RWURegularExpression (const RWUString &pattern, UnicodeConformanceLevel level=Basic, int32_t options=int32_t(Normal), const RWULocale &locale=RWULocale::getDefault())

	~RWURegularExpression ()

RWUCollator::CollationStrength	getCollationStrength () const

UnicodeConformanceLevel	getLevel () const

RWULocale	getLocale () const

int32_t	getOptions () const

RWUString	getPattern () const

RWURegexResult	matchAt (const RWUString &str) const

RWURegexResult	matchAt (const RWUString &str, const RWUConstStringIterator &start) const

RWURegexResult	matchAt (const RWUString &str, const RWUConstStringIterator &start, const RWUConstStringIterator &end) const

bool	operator< (const RWURegularExpression &rhs)

RWURegularExpression &	operator= (const RWURegularExpression &rhs)

bool	operator== (const RWURegularExpression &rhs) const

size_t	replace (RWUString &str, const RWUString &replacement, size_t count, int32_t matchID, const RWUConstStringIterator &start) const

size_t	replace (RWUString &str, const RWUString &replacement, size_t count, int32_t matchID, const RWUConstStringIterator &start, const RWUConstStringIterator &end, bool replaceEmptyMatches=true) const

size_t	replace (RWUString &str, const RWUString &replacement, size_t count=size_t(1), int32_t matchID=0) const

RWURegexResult	search (const RWUString &str) const

RWURegexResult	search (const RWUString &str, const RWUConstStringIterator &start) const

RWURegexResult	search (const RWUString &str, const RWUConstStringIterator &start, const RWUConstStringIterator &end) const

void	setCollationStrength (RWUCollator::CollationStrength)

void	setLevel (UnicodeConformanceLevel level=Basic)

void	setLocale (const RWULocale &loc)

size_t	subCount () const

Detailed Description

RWURegularExpression supports regular expressions with Unicode extensions.

A regular expression is a string pattern composed of normal characters and special characters. Special characters are used to denote an arrangement of the other characters in the regular expression pattern. A regular expression can be used to search for, and perhaps replace, occurrences of the regular expression pattern in strings.

Regular expression syntax describes how to arrange normal characters and special characters to form a valid regular expression pattern. The regular expression syntax for RWURegularExpression is similar to that of the POSIX 2 extended regular expression (ERE) specification. For more information see the Internationalization Module User's Guide.

RWURegularExpression extends the POSIX 2 ERE syntax to provide support for Unicode basic and tailored regular expressions.

Basic Unicode regular expression support corresponds to Level 1 support, as described in the Unicode Regular Expression Guidelines ( Unicode Technical Report #18 (UTR-18) Version 5.1). Basic Unicode regular expressions are useful for the majority of Unicode strings, and extend the POSIX ERE standard with the following Unicode extensions:

Hexadecimal notation
Character categories
Subtraction
Simple word boundaries
Simple loose matches
Line breaks

Tailored Unicode regular expressions extend the basic regular expression functionality, corresponding to Level 2 and Level 3 support, also described in UTR-18 Version 5.1. In addition to some minor additions, tailored extensions include support for:

Treating surrogate pairs as single characters
Using the script property
Matching canonically equivalent character representations
Specifying grapheme clusters

For more information on basic and tailored regular expression support in the Internationalization Module, see the Internationalization Module User's Guide.

The Role of the Locale in a Regular Expression

RWURegularExpression accepts an RWULocale argument in its constructor, or via the setLocale() method.The regular expression instance uses the locale to determine locale-specific behavior in a tailored regular expression (Locales have little effect on basic regular expressions). Grapheme clusters, character sets, and the break locations for words, sentences and lines may change depending on locale. For example, the Spanish character 'ch' is found in the character set "[b-d]" in Spanish locales, but not in English.

For more information on creating regular expressions, see the Internationalization Module User's Guide.

Example: #include <rw/i18n/RWUConversionContext.h>

#include <rw/i18n/RWURegularExpression.h>

#include <rw/i18n/RWUString.h>

#include <iostream>

using std::cout;

using std::endl;

int main() {

// Indicate string literals are encoded as US-ASCII strings.

RWUConversionContext context("US-ASCII");

// Create a string in which to search.

RWUString text("The quick brown fox.");

// Create a regular expression to search for "own" as a

// distinct word. The character category [{WB}] will be

// interpreted in terms of the default locale. Use

// RWURegularExpression::setLocale() to interpret breaks

// in terms of a different locale.

RWURegularExpression regexp("[{WB}]own[{WB}]");

// This search should fail because "own" appears only

// within the word "brown" and not as a distinct word.

RWURegexResult result = regexp.search(text);

if (result) {

cout << "Overall match at offset " << int32_t(result.begin(text))

<< " with length " << result.getLength() << "." << endl;

} else {

cout << "No match" << endl;

}

// Create a regular expression to search for "quick" as

// a distinct word.

regexp = RWURegularExpression("[{WB}]quick[{WB}]");

// This search should succeed.

result = regexp.search(text);

if (result) {

cout << "Overall match at offset " << int32_t(result.begin(text))

<< " with length " << result.getLength() << "." << endl;

} else {

cout << "No match" << endl;

}

return 0;

}

RWUConversionContext
Specifies the default character encoding scheme for conversions between narrow character strings and ...
Definition RWUConversionContext.h:101

RWURegexResult
Stores Unicode regular expression match results.
Definition RWURegexResult.h:105

RWURegexResult::getLength
size_t getLength(size_t matchID=0) const

RWURegexResult::begin
RWUConstStringIterator begin(const RWUString &str, size_t matchID=0) const

RWURegularExpression
Represents a regular expression with Unicode extensions.
Definition RWURegularExpression.h:707

RWURegularExpression::RWURegularExpression
RWURegularExpression()

RWUString
Stores and manipulates Unicode character sequences encoded as UTF-16 code units.
Definition RWUString.h:187

Program output:

No match

Overall match at offset 4 with length 5.

See also: RWUStringSearch

Member Enumeration Documentation

◆ Options

enum RWURegularExpression::Options

Lists options for changing the behavior of RWURegularExpression pattern matching.

Enumerator

Normal

Specifies normal pattern matching operations, with no special options enabled.

IgnoreCase

Indicates that characters in the pattern string and search string should be compared without regard to case.

InterpretGraphemes

This option is valid only with Tailored regular expressions. This option causes the pattern compiler to recognize graphemes, such as "a\u0308", as a single unit. This changes, for example, how cardinalities are applied. For example, with this setting, "a\u0308*" matches zero or more occurrences of anything equivalent to "a\u0308", whereas without this option, the pattern would match an 'a', followed by zero or more occurrences of "\u0308".

Further, this option changes the behavior of '.'. With this option, '.' matches any logical character including graphemes (except those outlined above). Without the option, '.' matches any code point except for one which indicates the end of a logical line. (For a list of specific characters excepted, see the Internationalization Module User's Guide.)

◆ Status

enum RWURegularExpression::Status

Lists regular expression pattern error codes that could be reported during regular expression pattern compilation. These error codes are reported through an exception of type RWRegexErr.

Enumerator
Ok	Indicates that the pattern has been successfully compiled.
MissingEscapeSequence	Indicates a missing escape sequence, as in `"ab\"`.
InvalidHexNibble	Indicates an invalid hexadecimal escape sequence, as in `"ab\u00fg"`.
InsufficientHex8Data	Indicates an insufficient number of hex nibbles in an 8-bit hexadecimal escape sequence, as in `"ab\x0"`.
InsufficientHex16Data	Indicates an insufficient number of hex nibbles in a 16-bit hexadecimal escape sequence, as in `"ab\u00f"`.
MissingClosingBracket	Indicates a missing closing bracket on a bracket expression, as in `"ab[cd"`.
MissingClosingCurlyBrace	Indicates a missing closing curly brace in a cardinality specification, as in `"(abc){2,3"`.
MissingClosingParen	Indicates a missing closing parenthesis in a sub-expression definition, as in `"ab(c(d)ef"`.
UnmatchedClosingParen	Indicates that a closing parenthesis was found, for which there is no opening parenthesis, as in `"ab(cd)e)f"`.
InvalidSubexpression	Indicates that an invalid sub-expression specification has been encountered, such as `"ab(*cd)"`.
InvalidDataAfterOr	Indicates that the character following an alternation symbol, `"\|"`, was considered invalid, as in `"ab\|*cd"`, or `"ab\|\|cd"`.
InvalidDataBeforeOr	Indicates that the data preceding an alternation symbol, `"\|"`, was considered invalid, as in `"\|"`, `"\|bc"`, and `"ab(\|cd)"`.
ConsecutiveCardinalities	Indicates that consecutive cardinality specifiers were found in the pattern, as in `"a+"` or `"ab{2,3}"`.
InvalidCardinalityRange	Indicates that an invalid cardinality range was specified, as in `"ab{,}"`, and `"a{}"`.
LeadingCardinality	Specifies that a leading cardinality specifier was encountered, as in `"*a"`.
InvalidDecimalDigit	Specifies that an invalid decimal digit was encountered in a pattern string, as in `"ab{3,a}"`.
UnmatchedClosingCurly	Indicates that a closing curly brace was encountered for which there was no matching opening curly brace, as in `"ab2,3}"`.
NeverEndingCategoryName	Indicates that a category name was started, but that no closing curly brace was found to end the category name, as in `"[{L]+123"`.
InvalidCategoryName	Indicates that an unrecognized category name was specified in a bracket expression, as in `"[{Smile}]"`
InfiniteEmptyMatch	Indicates that a category that could produce a zero-length match was found with infinite cardinality. Such categories include: Word Break `"WB"`, Character Break `"CB"`, Line Break `"LB"`, Sentence Break `"SB"`, Beginning of Line `"BOL"`, and End of Line `"EOL"`. As such, the following are invalid: `"[{WB}]"`, or `"ab([{WB}])cd"`.
ASCIIConversionError	Indicates that a problem was encountered while converting an US-ASCII pattern string to UTF16. This can occur only when using the RWCString conversion constructor.
InvalidGraphemeCluster	Indicates that an invalid grapheme cluster specification was found. This implies that the grapheme cluster did not follow the syntax, `"\g{...}"`, where `"..."` is any sequence of code units. For example, `"\gab}"` is invalid because of a missing opening curly brace.
NumberOfStatusCodes	Indicates the number of status codes potentially reported during the compilation of regular expression patterns.

◆ UnicodeConformanceLevel

enum RWURegularExpression::UnicodeConformanceLevel

Describes the levels of Unicode Regular Expression support available through RWURegularExpression. Two levels are available: Basic (Level 1), and Tailored (Levels 2 and 3). Both are described in Version 5.1 of Unicode Technical Report #18 and the Internationalization Module User's Guide.

Enumerator
Basic	Specifies Basic Unicode regular expression support.
Tailored	Specifies Tailored Unicode regular expression suppor t, which adds full support for surrogates, and locale-based handling of graphemes and string collation.

Constructor & Destructor Documentation

◆ RWURegularExpression() [1/5]

RWURegularExpression::RWURegularExpression ( )

Default constructor. Creates an empty regular expression pattern object that does not match any input string.

◆ RWURegularExpression() [2/5]

RWURegularExpression::RWURegularExpression ( const RWURegularExpression & source )

Copy constructor. Creates a copy of the source RWURegularExpression object.

Exceptions

std::bad_alloc Thrown if memory resources are exhausted during pattern compilation.

◆ RWURegularExpression() [3/5]

RWURegularExpression::RWURegularExpression	(	const char *	pattern,
		UnicodeConformanceLevel	level = Basic,
		int32_t	options = int32_t(Normal),
		const RWULocale &	locale = RWULocale::getDefault(),
		RWUToUnicodeConverter &	converter = RWUToUnicodeConversionContext::getContext().getConverter() )

explicit

Constructs an RWURegularExpression from the null-terminated char* pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any escape sequences are handled as for RWUString::unescape().

The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.

Exceptions

std::bad_alloc	Thrown if memory resources are exhausted during pattern compilation.
RWRegexErr	Thrown to report pattern compilation errors.

◆ RWURegularExpression() [4/5]

RWURegularExpression::RWURegularExpression	(	const RWCString &	pattern,
		UnicodeConformanceLevel	level = Basic,
		int32_t	options = int32_t(Normal),
		const RWULocale &	locale = RWULocale::getDefault(),
		RWUToUnicodeConverter &	converter = RWUToUnicodeConversionContext::getContext().getConverter() )

explicit

Constructs an RWURegularExpression from the RWCString pattern. The argument pattern is converted to Unicode using the specified converter. The default encoding for the system is used in the absence of a specified converter. Any escape sequences are handled as for RWUString::unescape().

The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.

Exceptions

std::bad_alloc	Thrown if memory resources are exhausted during pattern compilation.
RWRegexErr	Thrown to report pattern compilation errors.

◆ RWURegularExpression() [5/5]

RWURegularExpression::RWURegularExpression	(	const RWUString &	pattern,
		UnicodeConformanceLevel	level = Basic,
		int32_t	options = int32_t(Normal),
		const RWULocale &	locale = RWULocale::getDefault() )

explicit

Constructs an RWURegularExpression from the RWUString pattern.

The conformance level indicates the desired level of Unicode Regular Expression conformance. The default is Basic.
The argument options is a int32_t bit-mask of Options, specifying special options for pattern matching. The default value is Normal, indicating no special matching options are used.

Exceptions

std::bad_alloc	Thrown if memory resources are exhausted during pattern compilation.
RWRegexErr	Thrown to report pattern compilation errors.

◆ ~RWURegularExpression()

RWURegularExpression::~RWURegularExpression ( )

Destructor.

Member Function Documentation

◆ getCollationStrength()

RWUCollator::CollationStrength RWURegularExpression::getCollationStrength ( ) const

Returns the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions.

Exceptions

RWUException Thrown if invoked on a basic regular expression.

◆ getLevel()

UnicodeConformanceLevel RWURegularExpression::getLevel ( ) const

Returns the current level of Unicode regular expression support associated with self.

◆ getLocale()

RWULocale RWURegularExpression::getLocale ( ) const

Returns a copy of the locale used by self.

◆ getOptions()

int32_t RWURegularExpression::getOptions ( ) const

Returns the pattern matching Options associated with self as an int32_t bit-mask.

◆ getPattern()

RWUString RWURegularExpression::getPattern ( ) const

Returns the RWUString pattern string currently associated with self.

◆ matchAt() [1/3]

RWURegexResult RWURegularExpression::matchAt ( const RWUString & str ) const

inline

Tests for a match for this regular expression at the first character position in input string str. Does not find matches that begin after this position.

◆ matchAt() [2/3]

RWURegexResult RWURegularExpression::matchAt	(	const RWUString &	str,
		const RWUConstStringIterator &	start ) const

inline

Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches that begin other than at this position.

◆ matchAt() [3/3]

RWURegexResult RWURegularExpression::matchAt	(	const RWUString &	str,
		const RWUConstStringIterator &	start,
		const RWUConstStringIterator &	end ) const

Tests for a match for this regular expression at the specified start character position in input string str. Does not find matches at other than the start position or that end after the end position.

◆ operator<()

bool RWURegularExpression::operator< ( const RWURegularExpression & rhs )

Compares two regular expression objects. The comparison is performed using RWUString::operator<() to compare the pattern strings stored in each regular expression. Returns true if self's pattern is less than the rhs pattern; otherwise, false.

◆ operator=()

RWURegularExpression & RWURegularExpression::operator= ( const RWURegularExpression & rhs )

Assigns the rhs regular expression object to self.

◆ operator==()

bool RWURegularExpression::operator== ( const RWURegularExpression & rhs ) const

Compares two regular expression objects. The comparison is performed using RWUString::operator==() to compare the pattern strings stored in each regular expression. Returns true if self's pattern is equal to the rhs pattern; otherwise, false.

◆ replace() [1/3]

size_t RWURegularExpression::replace	(	RWUString &	str,
		const RWUString &	replacement,
		size_t	count,
		int32_t	matchID,
		const RWUConstStringIterator &	start ) const

inline

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1. Specifying a count of 0 replaces all occurrences of the pattern. The search for pattern matches begins at the specified start position. Returns the number of replacements. Empty (zero-length) matches are replaced.

◆ replace() [2/3]

size_t RWURegularExpression::replace	(	RWUString &	str,
		const RWUString &	replacement,
		size_t	count,
		int32_t	matchID,
		const RWUConstStringIterator &	start,
		const RWUConstStringIterator &	end,
		bool	replaceEmptyMatches = true ) const

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. Specifying a count of 0 replaces all occurrences of the pattern. The search for pattern matches begins at a specified start position. No match that extends beyond the specified end position is replaced. The method also allows you to specify whether or not empty (zero-length) matches should be replaced; the default is true.

◆ replace() [3/3]

size_t RWURegularExpression::replace	(	RWUString &	str,
		const RWUString &	replacement,
		size_t	count = size_t(1),
		int32_t	matchID = 0 ) const

inline

Replaces substrings in str that match this regular expression with the specified replacement string. Up to count occurrences are replaced. The default count is 1. Specifying a count of 0 replaces all occurrences of the pattern. Returns the number of replacements. Empty (zero-length) matches are replaced.

◆ search() [1/3]

RWURegexResult RWURegularExpression::search ( const RWUString & str ) const

inline

Searches input string str for substrings that match this regular expression. The search begins at the beginning of the string, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.

◆ search() [2/3]

RWURegexResult RWURegularExpression::search	(	const RWUString &	str,
		const RWUConstStringIterator &	start ) const

inline

Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the end of the string is reached, or a match is found. Returns an instance of RWURegexResult to report the result of the operation.

◆ search() [3/3]

RWURegexResult RWURegularExpression::search	(	const RWUString &	str,
		const RWUConstStringIterator &	start,
		const RWUConstStringIterator &	end ) const

Searches input string str for substrings that match this regular expression. The search begins at the specified start position, and continues until either the specified end position is reached, or a match is found. No match that extends beyond the specified end position is found. Returns an instance of RWURegexResult to report the result of the operation.

◆ setCollationStrength()

void RWURegularExpression::setCollationStrength ( RWUCollator::CollationStrength )

Sets the collation strength for the collator used in pattern matching with self. This method applies only to Tailored regular expressions.

Exceptions

RWUException Thrown if this method is invoked on a basic regular expression.

◆ setLevel()

void RWURegularExpression::setLevel ( UnicodeConformanceLevel level = Basic )

Sets the Unicode conformance level for self to the specified level. The default is Basic.

Note: The regular expression pattern will be recompiled into a form that more efficiently allows for the specified level of Unicode support.

◆ setLocale()

void RWURegularExpression::setLocale ( const RWULocale & loc )

Imbues a locale on the regular expression object. The locale is used internally in the detection of breaks in the text.

◆ subCount()

size_t RWURegularExpression::subCount ( ) const

Returns the count of parenthesized subexpressions contained in the regular expression pattern associated with self. For example, in the pattern a(b(c)d)e, there are two parenthesized subexpressions.

SourcePro® API Reference Guide

Public Types

Public Member Functions

Detailed Description

Member Enumeration Documentation

◆ Options

◆ Status

◆ UnicodeConformanceLevel

Constructor & Destructor Documentation

◆ RWURegularExpression() [1/5]

◆ RWURegularExpression() [2/5]

◆ RWURegularExpression() [3/5]

◆ RWURegularExpression() [4/5]

◆ RWURegularExpression() [5/5]

◆ ~RWURegularExpression()

Member Function Documentation

◆ getCollationStrength()

◆ getLevel()

◆ getLocale()

◆ getOptions()

◆ getPattern()

◆ matchAt() [1/3]

◆ matchAt() [2/3]

◆ matchAt() [3/3]

◆ operator<()

◆ operator=()

◆ operator==()

◆ replace() [1/3]

◆ replace() [2/3]

◆ replace() [3/3]

◆ search() [1/3]

◆ search() [2/3]

◆ search() [3/3]

◆ setCollationStrength()

◆ setLevel()

◆ setLocale()

◆ subCount()