Supports regular expression matching based on the POSIX.2 standard and supports both narrow and wide characters. More...
#include <rw/tools/regex.h>
Public Types | |
typedef RWTRegexMatchIterator< T > | iterator |
typedef RWTRegexMatchIterator< T > | match_iterator |
typedef RWTRegexTraits< T >::Char | RChar |
typedef std::basic_string< RChar > | RString |
enum | RWTRegexStatus { Ok, MissingEscapeSequence, InvalidHexNibble, InsufficientHex8Data, InsufficientHex16Data, MissingClosingBracket, MissingClosingCurlyBrace, MissingClosingParen, UnmatchedClosingParen, InvalidSubexpression, InvalidDataAfterOr, InvalidDataBeforeOr, ConsecutiveCardinalities, InvalidCardinalityRange, LeadingCardinality, InvalidDecimalDigit, UnmatchedClosingCurly, NumberOfStatusCodes } |
Public Member Functions | |
RWTRegex () | |
RWTRegex (const RChar *str, size_t length=size_t(-1)) | |
RWTRegex (const RString &str, size_t length=size_t(-1)) | |
RWTRegex (const RWTRegex &source) | |
RWTRegex (RWTRegex &&rhs) | |
virtual | ~RWTRegex () |
const RWRegexErr & | getStatus () const |
size_t | index (const RChar *str, size_t *mLen=0, size_t start=size_t(0), size_t length=size_t(-1)) |
size_t | index (const RString &str, size_t *mLen=0, size_t start=size_t(0), size_t length=size_t(-1)) |
RWTRegexResult< T > | matchAt (const RChar *str, size_t start=size_t(0), size_t length=size_t(-1)) |
RWTRegexResult< T > | matchAt (const RString &str, size_t start=size_t(0), size_t length=size_t(-1)) |
bool | operator< (const RWTRegex &rhs) const |
RWTRegex & | operator= (const RWTRegex &rhs) |
RWTRegex & | operator= (RWTRegex &&rhs) |
bool | operator== (const RWTRegex &rhs) const |
size_t | replace (RString &str, const RString &replacement, size_t count=1, size_t matchID=0, size_t start=size_t(0), size_t length=size_t(-1), bool replaceEmptyMatches=true) |
RWTRegexResult< T > | search (const RChar *str, size_t start=size_t(0), size_t length=size_t(-1)) |
RWTRegexResult< T > | search (const RString &str, size_t start=size_t(0), size_t length=size_t(-1)) |
size_t | subCount () const |
void | swap (RWTRegex< T > &rhs) |
RWTRegex is the primary template for the regular expression interface. It provides most of the POSIX.2 standard for regular expression pattern matching and may be used for both narrow (8-bit) and wide (wchar_t
) character strings.
RWTRegex can represent both a simple and an extended regular expression such as those found in lex
and awk
. The constructor "compiles" the expression into a form that can be used more efficiently. The results can then be used for string searches using class RWCString. Regular expressions (REs) can be of arbitrary size, limited by memory. The extended regular expression features found here are a subset of those found in the POSIX.2 standard (ANSI/IEEE Std. 1003.2, ISO/IEC 9945-2).
RWTRegex differs from the POSIX.2 standard in the following ways:
\
). (The POSIX standard dictates that some RE special characters are escaped when used to form a pattern.)Constructing a regular expression
To match a single character RE
Any character that is not a special character matches itself.
\
) followed by any special character matches the literal character itself; that is, its use "escapes" the special character. For example, \*
matches "*" without applying the syntax of the *
special character.
[
]) is a one-character RE that matches any of the characters in that set. This means that
[akm] matches either an "a", "k", or "m". A range of characters can be indicated with a dash, as in
[a-z], which matches any lower-case letter. However, if the first character of the set is the caret (^
), then the RE matches any character except those in the set. It does not match the empty string. For example:
[^akm] matches any character except "a", "k", or "m". The caret loses its special meaning if it is not the first character of the set.To match a multicharacter RE
( )
) group parts of regular expressions together into subexpressions that can be treated as a single unit. For example, (ha)+
matches one or more "ha"s.*
) following a one-character RE or a parenthesized subexpression matches zero or more occurrences of the RE. Hence, [a-z]*
and (ha)*
match zero or more lower-case characters.+
) following a one-character RE or a parenthesized subexpression matches one or more occurrences of the RE. Hence, [a-z]+
and (ha)+
match one or more lower-case characters.
?) is an optional element. The preceding RE can occur zero or once in the string – no more. For example, xy?z
matches either xyz or xz.[A-Z][a-z]*
matches matches any capitalized word.
| ) allows a choice between two regular expressions. For example, jell(y|ies)
matches either "jelly" or "jellies".{
}
) following a one-character RE matches the preceding element according to the number indicated. For example, a{2,3}
matches either "aa" or "aaa."All or part of the regular expression can be "anchored" to either the beginning or end of the string being searched.
^
) is at the beginning of the (sub)expression, then the matched string must be at the beginning of the string being searched. For example, you could use "t^hat" to return all occurrences of "hat" but avoid returning "that".$
) is at the end of the (sub)expression, then the matched string must be at the end of the string being searched. For example, "know$" would match "I know what I know" but not "He knows what
he knows."Overriding the backslash special character
A common pitfall with regular expression classes is overriding the backslash special character (\
). The C++ compiler and the regular expression constructor will both assume that any backslashes they see are intended to escape the following character. Thus, to specify a regular expression that exactly matches "a\a"
, create the regular expression using four backslashes as follows: the regular expression needs to see "a\\a"
, and for that to happen, the compiler would have to see "a\\\\a"
.
The backslashes marked with a ^
are an escape for the compiler, and the ones marked with | will thus be seen by the regular expression parser. At that point, the backslash marked
1
is an escape, and the one marked 2
will actually be put into the regular expression.
Similarly, if you really need to escape a character, such as a '.', you will have to pass two backslashes to the compiler:
Once again, the backslash marked ^
is an escape for the compiler, and the one marked with | will be seen by the regular expression constructor as an escape for the following '
.' .
Related classes include:
Program output:
typedef RWTRegexMatchIterator<T> RWTRegex< T >::iterator |
Typedef based on the character type used to instantiate RWTRegex. For example, for RWTRegex<char>::iterator is a typedef for RWTRegexMatchIterator<char>.
typedef RWTRegexMatchIterator<T> RWTRegex< T >::match_iterator |
Typedef based on the character type used to instantiate RWTRegex.
typedef RWTRegexTraits<T>::Char RWTRegex< T >::RChar |
Typedef for the character type.
Typedef for a string type to be used with RWTRegex.
enum RWTRegex::RWTRegexStatus |
Defines allowable status codes. These codes are accessed by RWRegexErr.
Default constructor. Objects initialized with this constructor represent uninitialized patterns. These objects should be assigned a valid pattern before use.
Initializes an RWTRegex object to represent the pattern specified in the str parameter.
The parameter str specifies the pattern string for the regular expression.
The parameter length specifies the length, in characters, of the pattern string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL
character, according to its character traits. (The traits for each type of character are defined in RWTRegexTraits.)
RWRegexErr | Thrown if a pattern error is encountered. |
Initializes an RWTRegex object to represent the pattern specified in str.
str | The pattern string for the RE. |
length | The length, in characters, of the pattern string. If length is not specified, the length of str is used. |
RWRegexErr | Thrown if a pattern error is encountered. |
Move constructor. The constructed instance takes ownership of the data owned by rhs.
Destructor. Releases any allocated memory.
const RWRegexErr& RWTRegex< T >::getStatus | ( | ) | const |
Returns the regular expression status for the last-pattern compilation status. This method is useful primarily in exception-disabled environments in which the default error handler for the Essential Tools Module error framework has been replaced with a function that does not abort. Otherwise, the regular expression object will not be available for this query.
size_t RWTRegex< T >::index | ( | const RChar * | str, |
size_t * | mLen = 0 , |
||
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) |
||
) |
Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string. It continues, one character at a time, until either a match is found, or the end of the string is reached. Use length to specify the length of the input string.
str | The string to be searched for a match. |
mLen | A return parameter representing the length of any match found during this operation. If not supplied, (NULL ), the length is not returned, but is available through RWTRegexResult<T>::getLength(). |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL character, as defined by the traits specific to this type of character. |
Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.
size_t RWTRegex< T >::index | ( | const RString & | str, |
size_t * | mLen = 0 , |
||
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) |
||
) |
Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or the end of the string is reached. Use length to specify the length of the input string.
str | The string to be searched for a match. |
mLen | A return parameter representing the length of any match found during this operation. If not supplied, (NULL ), the length is not returned, but is available through RWTRegexResult<T>::getLength(). |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object. |
RWTRegexResult<T> RWTRegex< T >::matchAt | ( | const RChar * | str, |
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) |
||
) |
Searches an input string for a match against the pattern string represented by this RWTRegex object. The match must start at the specified character in the input string. (This is similar to anchoring the pattern at the beginning of the string using the circumflex character ^
.)
true
, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() represents the longest match starting from the first character in the string.false
.str | The string to be searched for a match. |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, then it is calculated as the number of characters preceding the first occurrence of a NULL character, as defined by the traits specific to this type of character. |
RWTRegexResult<T> RWTRegex< T >::matchAt | ( | const RString & | str, |
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) |
||
) |
Searches an input string for a match against the pattern string represented by this RWTRegex object. The match must start at the specified character in the input string. (This is similar to anchoring the pattern at the beginning of the string using the circumflex character ^
.)
true
, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() represents the longest match starting from the first character in the string.false
.str | The string to be searched for a match. |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object. |
Compares this RWTRegex object to the rhs RWTRegex object by performing an element-by-element comparison of the characters in each object's pattern string. Character comparisons are performed as defined by the lt
method on the RWTRegexTraits class implemented for the type of character in use.
This object is considered less than rhs if it contains the lesser of the first two unequal characters, from left to right, or if there are no unequal characters, but this pattern string is shorter than rhs, i.e. this pattern string has fewer characters.
Returns true
if this RWTRegex is less rhs.
Move assignment. Self takes ownership of the data owned by rhs.
Compares this RWTRegex object to the rhs RWTRegex object by performing an element-by-element comparison of the characters in each object's pattern string. Character comparisons are performed as defined by the eq
method on the RWTRegexTraits class implemented for the type of character in use.
This object is considered equal to rhs if it contains the same number of characters, and each corresponding pair of characters in the patterns are equal to one another.
Returns true
if this RWTRegex is equal to rhs.
size_t RWTRegex< T >::replace | ( | RString & | str, |
const RString & | replacement, | ||
size_t | count = 1 , |
||
size_t | matchID = 0 , |
||
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) , |
||
bool | replaceEmptyMatches = true |
||
) |
Replaces occurrences of the regular expression pattern in str with a replacement string, replacement. The number of replacements is identified by count. The default value for count is 1
, which replaces only the first occurrence of the pattern.
Zero-length matches are replaced only if replaceEmptyMatches is true
. The search begins at the start character position. The length, in characters, of the original string is identified by length. The input str is updated as part of this operation.
Returns the total number of occurrences replaced.
str | The string to be searched for a match. |
replacement | The string to replace all occurrences of the pattern in str. |
count | The number of matches to replace. If 0 is specified, all matches are replaced. |
matchID | The match identifier of the sub-expression to be replaced. The default value of 0 replaces the overall match with specified replacement text. |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, it is assigned the length of the input string object. |
replaceEmptyMatches | Boolean. If true , zero-length matches are replaced, as well as all other matches. Otherwise, only matches with length greater than zero are replaced. |
Returns the starting character position, from the beginning of the string, of a match. If no match is found, RW_NPOS is returned.
RWTRegexResult<T> RWTRegex< T >::search | ( | const RChar * | str, |
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) |
||
) |
Searches an input string for the first occurrence of a match for this RE pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or until the end of the string is reached.
true
, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() will represent the longest match starting from the first position.false
.str | The string to be searched for a match. |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, it is calculated as the number of characters preceding the firs occurrence of a NULL character, as defined by this character's traits. |
RWTRegexResult<T> RWTRegex< T >::search | ( | const RString & | str, |
size_t | start = size_t(0) , |
||
size_t | length = size_t(-1) |
||
) |
Searches an input string for the first occurrence of a match for this regular expression pattern. The search begins with the character at a specified start character position in the supplied input string, and continues, one character at a time, until either a match is found, or until the end of the string is reached.
true
, and the match information returned through RWTRegexResult<T>::getStart() and RWTRegexResult<T>::getLength() will represent the longest match starting from the first position at which a match is found.false
.str | The string to be searched for a match. |
start | The character position where the search for a match will start. |
length | The length, in characters, of the entire input string. If the length is not specified, then it is assigned the length of the input string object. |
size_t RWTRegex< T >::subCount | ( | ) | const |
Returns the number of parenthesized subexpressions in this regular expression.
Copyright © 2021 Rogue Wave Software, Inc., a Perforce company. All Rights Reserved. |