Supports regular expression matching based on the POSIX.2 standard and supports both narrow and wide characters. More...

#include <rw/tools/regex.h>

Public Types
typedef RWTRegexMatchIterator< T >	iterator

typedef RWTRegexMatchIterator< T >	match_iterator

typedef RWTRegexTraits< T >::Char	RChar

typedef std::basic_string< RChar >	RString

enum	RWTRegexStatus { Ok, MissingEscapeSequence, InvalidHexNibble, InsufficientHex8Data, InsufficientHex16Data, MissingClosingBracket, MissingClosingCurlyBrace, MissingClosingParen, UnmatchedClosingParen, InvalidSubexpression, InvalidDataAfterOr, InvalidDataBeforeOr, ConsecutiveCardinalities, InvalidCardinalityRange, LeadingCardinality, InvalidDecimalDigit, UnmatchedClosingCurly, NumberOfStatusCodes }

Public Member Functions
	RWTRegex ()

	RWTRegex (const RChar *str, size_t length=size_t(-1))

	RWTRegex (const RString &str, size_t length=size_t(-1))

	RWTRegex (const RWTRegex &source)

	RWTRegex (RWTRegex &&rhs)

virtual	~RWTRegex ()

const RWRegexErr &	getStatus () const

size_t	index (const RChar str, size_t mLen=0, size_t start=size_t(0), size_t length=size_t(-1))

size_t	index (const RString &str, size_t *mLen=0, size_t start=size_t(0), size_t length=size_t(-1))

RWTRegexResult< T >	matchAt (const RChar *str, size_t start=size_t(0), size_t length=size_t(-1))

RWTRegexResult< T >	matchAt (const RString &str, size_t start=size_t(0), size_t length=size_t(-1))

bool	operator< (const RWTRegex &rhs) const

RWTRegex &	operator= (const RWTRegex &rhs)

RWTRegex &	operator= (RWTRegex &&rhs)

bool	operator== (const RWTRegex &rhs) const

size_t	replace (RString &str, const RString &replacement, size_t count=1, size_t matchID=0, size_t start=size_t(0), size_t length=size_t(-1), bool replaceEmptyMatches=true)

RWTRegexResult< T >	search (const RChar *str, size_t start=size_t(0), size_t length=size_t(-1))

RWTRegexResult< T >	search (const RString &str, size_t start=size_t(0), size_t length=size_t(-1))

size_t	subCount () const

void	swap (RWTRegex< T > &rhs)

Detailed Description

template<class T>
class RWTRegex< T >

RWTRegex is the primary template for the regular expression interface. It provides most of the POSIX.2 standard for regular expression pattern matching and may be used for both narrow (8-bit) and wide (wchar_t) character strings.

RWTRegex can represent both a simple and an extended regular expression such as those found in lex and awk. The constructor "compiles" the expression into a form that can be used more efficiently. The results can then be used for string searches using class RWCString. Regular expressions (REs) can be of arbitrary size, limited by memory. The extended regular expression features found here are a subset of those found in the POSIX.2 standard (ANSI/IEEE Std. 1003.2, ISO/IEC 9945-2).

RWTRegex differs from the POSIX.2 standard in the following ways:

RWTRegex treats all RE special characters as special, unless escaped (prefixed with a \). (The POSIX standard dictates that some RE special characters are escaped when used to form a pattern.)
RWTRegex does not currently support locale-based constructs, such as collating symbols, equivalence classes, or character classes.
RWTRegex does not support backreferences. Backreferencing is not supported in extended regular expressions (EREs) but only in basic regular expressions (BREs).

Constructing a regular expression

To match a single character RE

Any character that is not a special character matches itself.

A backslash (\) followed by any special character matches the literal character itself; that is, its use "escapes" the special character. For example, \* matches "*" without applying the syntax of the * special character.
The "special characters" are:
+ * ? . [ ] ^ $ ( ) { } | \
The period (.) matches any character. For example, ".umpty" matches either "Humpty" or "Dumpty."
A set of characters enclosed in brackets ([ ]) is a one-character RE that matches any of the characters in that set. This means that [akm] matches either an "a", "k", or "m". A range of characters can be indicated with a dash, as in [a-z], which matches any lower-case letter. However, if the first character of the set is the caret (^), then the RE matches any character except those in the set. It does not match the empty string. For example: [^akm] matches any character except "a", "k", or "m". The caret loses its special meaning if it is not the first character of the set.

To match a multicharacter RE

Parentheses (( )) group parts of regular expressions together into subexpressions that can be treated as a single unit. For example, (ha)+ matches one or more "ha"s.
An asterisk (*) following a one-character RE or a parenthesized subexpression matches zero or more occurrences of the RE. Hence, [a-z]* and (ha)* match zero or more lower-case characters.
A plus (+) following a one-character RE or a parenthesized subexpression matches one or more occurrences of the RE. Hence, [a-z]+ and (ha)+ match one or more lower-case characters.
A question mark (?) is an optional element. The preceding RE can occur zero or once in the string – no more. For example, xy?z matches either xyz or xz.
The concatenation of REs is a RE that matches the corresponding concatenation of strings. For example, [A-Z][a-z]* matches matches any capitalized word.
The OR character ( | ) allows a choice between two regular expressions. For example, jell(y|ies) matches either "jelly" or "jellies".
Braces ({ }) following a one-character RE matches the preceding element according to the number indicated. For example, a{2,3} matches either "aa" or "aaa."

All or part of the regular expression can be "anchored" to either the beginning or end of the string being searched.

If the caret (^) is at the beginning of the (sub)expression, then the matched string must be at the beginning of the string being searched. For example, you could use "t^hat" to return all occurrences of "hat" but avoid returning "that".
If the dollar sign ($) is at the end of the (sub)expression, then the matched string must be at the end of the string being searched. For example, "know$" would match "I know what I know" but not "He knows what he knows."

Overriding the backslash special character

A common pitfall with regular expression classes is overriding the backslash special character (\). The C++ compiler and the regular expression constructor will both assume that any backslashes they see are intended to escape the following character. Thus, to specify a regular expression that exactly matches "a\a", create the regular expression using four backslashes as follows: the regular expression needs to see "a\\a", and for that to happen, the compiler would have to see "a\\\\a".

RWTRegex<char> reg("a\\\\a");
                     ^|^|
                      1 2

The backslashes marked with a ^ are an escape for the compiler, and the ones marked with | will thus be seen by the regular expression parser. At that point, the backslash marked 1 is an escape, and the one marked 2 will actually be put into the regular expression.

Similarly, if you really need to escape a character, such as a '.', you will have to pass two backslashes to the compiler:

RWTRegex<char> regDot("\\.");

^|

Once again, the backslash marked ^ is an escape for the compiler, and the one marked with | will be seen by the regular expression constructor as an escape for the following '.' .

Synopsis: #include <rw/tools/regex.h>

RWTRegex<char> re0(".*\\.doc");

// Matches filenames with suffix ".doc"

RWCRegularExpression re1("a+");

// Matches one or more 'a'

RWWRegularExpression re2(L"b+");

// Matches one or more wide-character, 'b'

See also

Related classes include:

RWTRegexMatchIterator which iterates over matches of a pattern in a given string.
RWTRegexResult which encapsulates the results of a pattern matching operation.
RWTRegexTraits which defines the character traits for a specific type of regular expression character and includes methods for returning these values.
RWRegexErr which reports errors from within RWTRegex.

Persistence: None

Example: #include <rw/tools/regex.h>

#include <rw/cstring.h>

#include <iostream>

using std::cout;

using std::endl;

int main()

{

RWCString aString("Hark! Hark! The lark");

// This regular expression matches any lowercase word

// or end of a word starting with "l"

RWTRegex<char> re("l[a-z]*");

RWTRegexResult<char> result;

if (result = re.search(aString))

cout << result.subString(aString) << endl; //Prints "lark"

return 0;

}

Program output:

lark

Member Typedef Documentation

template<class T>

typedef RWTRegexMatchIterator<T> RWTRegex< T >::iterator

Typedef based on the character type used to instantiate RWTRegex. For example, for RWTRegex<char>::iterator is a typedef for RWTRegexMatchIterator<char>.

Note: RWTRegex::match_iterator and RWTRegex::iterator are provided. RWTRegex::iterator is a match iterator. If you need to add new iterator types, you must give them a descriptive prefix, as in RWTRegex::match_iterator.

template<class T>

typedef RWTRegexMatchIterator<T> RWTRegex< T >::match_iterator

Typedef based on the character type used to instantiate RWTRegex.

template<class T>

typedef RWTRegexTraits<T>::Char RWTRegex< T >::RChar

Typedef for the character type.

template<class T>

typedef std::basic_string<RChar> RWTRegex< T >::RString

Typedef for a string type to be used with RWTRegex.

Member Enumeration Documentation

template<class T>

enum RWTRegex::RWTRegexStatus

Defines allowable status codes. These codes are accessed by RWRegexErr.

Enumerator
Ok
MissingEscapeSequence
InvalidHexNibble
InsufficientHex8Data
InsufficientHex16Data
MissingClosingBracket
MissingClosingCurlyBrace
MissingClosingParen
UnmatchedClosingParen
InvalidSubexpression
InvalidDataAfterOr
InvalidDataBeforeOr
ConsecutiveCardinalities
InvalidCardinalityRange
LeadingCardinality
InvalidDecimalDigit
UnmatchedClosingCurly
NumberOfStatusCodes

Constructor & Destructor Documentation

template<class T>

RWTRegex< T >::RWTRegex ( )

Default constructor. Objects initialized with this constructor represent uninitialized patterns. These objects should be assigned a valid pattern before use.

template<class T>

RWTRegex< T >::RWTRegex	(	const RChar *	str,
		size_t	length = `size_t(-1)`
	)

Initializes an RWTRegex object to represent the pattern specified in the str parameter.

The parameter str specifies the pattern string for the regular expression.

The parameter length specifies the length, in characters, of the pattern string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL character, according to its character traits. (The traits for each type of character are defined in RWTRegexTraits.)

Exceptions

RWRegexErr Thrown if a pattern error is encountered.

template<class T>

RWTRegex< T >::RWTRegex	(	const RString &	str,
		size_t	length = `size_t(-1)`
	)

Initializes an RWTRegex object to represent the pattern specified in str.

Parameters

str	The pattern string for the RE.
length	The length, in characters, of the pattern string. If length is not specified, the length of str is used.

Exceptions

RWRegexErr Thrown if a pattern error is encountered.

template<class T>

RWTRegex< T >::RWTRegex ( const RWTRegex< T > & source )

Copy constructor. The pattern represented by the source RWTRegex object is copied to this RWTRegex object. This copying operation is performed without recompiling the original pattern.

template<class T>

RWTRegex< T >::RWTRegex ( RWTRegex< T > && rhs )

Move constructor. The constructed instance takes ownership of the data owned by rhs.

Condition:: This method is available only on platforms with rvalue reference support.

template<class T>

virtual RWTRegex< T >::~RWTRegex ( )

virtual

Destructor. Releases any allocated memory.

Member Function Documentation

template<class T>

const RWRegexErr& RWTRegex< T >::getStatus ( ) const

Returns the regular expression status for the last-pattern compilation status. This method is useful primarily in exception-disabled environments in which the default error handler for the Essential Tools Module error framework has been replaced with a function that does not abort. Otherwise, the regular expression object will not be available for this query.

template<class T>

size_t RWTRegex< T >::index	(	const RChar *	str,
		size_t *	mLen = `0`,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`
	)

Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string. It continues, one character at a time, until either a match is found, or the end of the string is reached. Use length to specify the length of the input string.

If a match is found, returns the index into the string at which the first match was found, starting from the beginning of the string. The length of the match is returned in the mLen argument.
If no match is found, returns RW_NPOS.

Parameters

str	The string to be searched for a match.
mLen	A return parameter representing the length of any match found during this operation. If not supplied, (`NULL`), the length is not returned, but is available through RWTRegexResult<T>::getLength().
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a `NULL` character, as defined by the traits specific to this type of character.

Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.

template<class T>

size_t RWTRegex< T >::index	(	const RString &	str,
		size_t *	mLen = `0`,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`
	)

Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or the end of the string is reached. Use length to specify the length of the input string.

If a match is found, the method returns the index into the string at which the first match was found, starting from the beginning of the string. The length of the match is returned in the mLen argument.
If no match is found, the method returns RW_NPOS.

Parameters

str	The string to be searched for a match.
mLen	A return parameter representing the length of any match found during this operation. If not supplied, (`NULL`), the length is not returned, but is available through RWTRegexResult<T>::getLength().
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object.

template<class T>

RWTRegexResult<T> RWTRegex< T >::matchAt	(	const RChar *	str,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`
	)

Searches an input string for a match against the pattern string represented by this RWTRegex object. The match must start at the specified character in the input string. (This is similar to anchoring the pattern at the beginning of the string using the circumflex character ^.)

If a match is found, returns true, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() represents the longest match starting from the first character in the string.
If no match is found, returns false.

Parameters

str	The string to be searched for a match.
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a `NULL` character, as defined by the traits specific to this type of character.

template<class T>

RWTRegexResult<T> RWTRegex< T >::matchAt	(	const RString &	str,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`
	)

Searches an input string for a match against the pattern string represented by this RWTRegex object. The match must start at the specified character in the input string. (This is similar to anchoring the pattern at the beginning of the string using the circumflex character ^.)

If a match is found, returns true, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() represents the longest match starting from the first character in the string.
If no match is found, returns false.

Parameters

str	The string to be searched for a match.
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object.

template<class T>

bool RWTRegex< T >::operator< ( const RWTRegex< T > & rhs ) const

Compares this RWTRegex object to the rhs RWTRegex object by performing an element-by-element comparison of the characters in each object's pattern string. Character comparisons are performed as defined by the lt method on the RWTRegexTraits class implemented for the type of character in use.

This object is considered less than rhs if it contains the lesser of the first two unequal characters, from left to right, or if there are no unequal characters, but this pattern string is shorter than rhs, i.e. this pattern string has fewer characters.

Returns true if this RWTRegex is less rhs.

template<class T>

RWTRegex& RWTRegex< T >::operator= ( const RWTRegex< T > & rhs )

Assignment operator. Copies the RWTRegex object specified by rhs into this RWTRegex object. The copy is performed without recompiling the original pattern. Returns a reference to this newly assigned RWTRegex object.

template<class T>

RWTRegex& RWTRegex< T >::operator= ( RWTRegex< T > && rhs )

Move assignment. Self takes ownership of the data owned by rhs.

Condition:: This method is available only on platforms with rvalue reference support.

template<class T>

bool RWTRegex< T >::operator== ( const RWTRegex< T > & rhs ) const

Compares this RWTRegex object to the rhs RWTRegex object by performing an element-by-element comparison of the characters in each object's pattern string. Character comparisons are performed as defined by the eq method on the RWTRegexTraits class implemented for the type of character in use.

This object is considered equal to rhs if it contains the same number of characters, and each corresponding pair of characters in the patterns are equal to one another.

Returns true if this RWTRegex is equal to rhs.

template<class T>

size_t RWTRegex< T >::replace	(	RString &	str,
		const RString &	replacement,
		size_t	count = `1`,
		size_t	matchID = `0`,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`,
		bool	replaceEmptyMatches = `true`
	)

Replaces occurrences of the regular expression pattern in str with a replacement string, replacement. The number of replacements is identified by count. The default value for count is 1, which replaces only the first occurrence of the pattern.

Zero-length matches are replaced only if replaceEmptyMatches is true. The search begins at the start character position. The length, in characters, of the original string is identified by length. The input str is updated as part of this operation.

Returns the total number of occurrences replaced.

Parameters

str	The string to be searched for a match.
replacement	The string to replace all occurrences of the pattern in str.
count	The number of matches to replace. If `0` is specified, all matches are replaced.
matchID	The match identifier of the sub-expression to be replaced. The default value of `0` replaces the overall match with specified replacement text.
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object.
replaceEmptyMatches	Boolean. If `true`, zero-length matches are replaced, as well as all other matches. Otherwise, only matches with length greater than zero are replaced.

Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.

template<class T>

RWTRegexResult<T> RWTRegex< T >::search	(	const RChar *	str,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`
	)

Searches an input string for the first occurrence of a match for this RE pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or until the end of the string is reached.

If a match is found, returns true, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() will represent the longest match starting from the first position.
If no match is found, returns false.

Parameters

str	The string to be searched for a match.
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, it is calculated as the number of characters preceding the firs occurrence of a `NULL` character, as defined by this character's traits.

template<class T>

RWTRegexResult<T> RWTRegex< T >::search	(	const RString &	str,
		size_t	start = `size_t(0)`,
		size_t	length = `size_t(-1)`
	)

Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or until the end of the string is reached.

If a match is found, returns true, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() will represent the longest match starting from the first position at which a match is found.
If no match is found returns false.

Parameters

str	The string to be searched for a match.
start	The character position where the search for a match will start.
length	The length, in characters, of the entire input string. If the length is not specified, then it is assigned the length of the input string object.

template<class T>

size_t RWTRegex< T >::subCount ( ) const

Returns the number of parenthesized subexpressions in this regular expression.

template<class T>

void RWTRegex< T >::swap ( RWTRegex< T > & rhs )

Swaps the data owned by self with the data owned by rhs.

SourcePro® API Reference Guide

Public Types

Public Member Functions

Detailed Description

template<class T> class RWTRegex< T >

Member Typedef Documentation

Member Enumeration Documentation

Constructor & Destructor Documentation

Member Function Documentation

template<class T>
class RWTRegex< T >