SourcePro® API Reference Guide

 
Classes | Typedefs

Module Description

Classes in this group let you perform string processing operations such as manipulating single and multibyte strings with class RWCString, or choosing class RWWString for wide character strings. These classes make it easy to do concatenation, comparison, indexing (with optional bounds checking), I/O, case changes, stripping, and many other functions.

For Unicode character manipulation, please see the Internationalization Classes.

Classes

class  RWBasicUString
 Represents and manages an array of UTF-16 values. More...
 
class  RWCConstSubString
 Allows some subsection of an RWCString to be addressed by defining a starting position and an extent. More...
 
class  RWCopyOnWriteCConstSubString
 Alternate implementation of RWCConstSubString when RW_COPY_ON_WRITE_STRING is defined. More...
 
class  RWCopyOnWriteCString
 Alternate implementation of RWCString when RW_COPY_ON_WRITE_STRING is defined. More...
 
class  RWCopyOnWriteCSubString
 Alternate implementation of RWCSubString when RW_COPY_ON_WRITE_STRING is defined. More...
 
class  RWCopyOnWriteWConstSubString
 Alternate implementation of RWWConstSubString when RW_COPY_ON_WRITE_STRING is defined. More...
 
class  RWCopyOnWriteWString
 Alternate implementation of RWWString when RW_COPY_ON_WRITE_STRING is defined. More...
 
class  RWCopyOnWriteWSubString
 Alternate implementation of RWWSubString when RW_COPY_ON_WRITE_STRING is defined. More...
 
class  RWCRegexp
 Deprecated. Represents a regular expression. More...
 
class  RWCString
 Offers powerful and convenient facilities for manipulating strings. More...
 
class  RWCSubString
 Allows some subsection of an RWCString to be addressed by defining a starting position and an extent. More...
 
class  RWCTokenizer
 Breaks a string into separate tokens, delimited by an arbitrary whitespace. Can be used as an alternative to the C++ Standard Library function std::strtok(). More...
 
class  RWRegexErr
 Exception class that reports errors from within RWTRegex. More...
 
class  RWStandardCConstSubString
 Alternate implementation of RWCConstSubString when RW_COPY_ON_WRITE_STRING is not defined. More...
 
class  RWStandardCString
 Alternate implementation of RWCString when RW_COPY_ON_WRITE_STRING is not defined. More...
 
class  RWStandardCSubString
 Alternate implementation of RWCSubString when RW_COPY_ON_WRITE_STRING is not defined. More...
 
class  RWStandardWConstSubString
 Alternate implementation of RWWConstSubString when RW_COPY_ON_WRITE_STRING is not defined. More...
 
class  RWStandardWString
 Alternate implementation of RWWString when RW_COPY_ON_WRITE_STRING is not defined. More...
 
class  RWStandardWSubString
 Alternate implementation of RWWSubString when RW_COPY_ON_WRITE_STRING is not defined. More...
 
class  RWTRegex< T >
 Supports regular expression matching based on the POSIX.2 standard and supports both narrow and wide characters. More...
 
class  RWTRegexMatchIterator< T >
 Iterates over matches found using RWTRegex. More...
 
class  RWTRegexResult< T >
 Encapsulates the results from a search using RWTRegex. More...
 
class  RWTRegexTraits< T >
 Defines static, inline methods for returning specific regular expression character values. More...
 
class  RWTRegularExpression< charT >
 Deprecated. Provides extended regular expression matching similar to that found in lex and awk. More...
 
class  RWWConstSubString
 Allows some subsection of an RWWString to be addressed by defining a starting position and an extent. More...
 
class  RWWString
 Offers powerful and convenient facilities for manipulating wide character strings. More...
 
class  RWWSubString
 Allows some subsection of an RWWString to be addressed by defining a starting position and an extent. More...
 
class  RWWTokenizer
 Breaks up a string into separate tokens, delimited by arbitrary whitespace. Can be used as an alternative to the C++ Standard Library function std::wcstok(). More...
 

Typedefs

typedef RWTRegularExpression< char > RWCRExpr
 Deprecated. This class is a typedef for RWTRegularExpression<char>. More...
 
typedef RWTRegularExpression< char > RWCRExpr
 

Typedef Documentation

typedef RWTRegularExpression<char> RWCRExpr
related
Deprecated:
As of SourcePro 4, use RWTRegex<char> instead.
template<class charT >
typedef RWTRegularExpression<char> RWCRExpr
related
Deprecated:
As of SourcePro 4, use RWTRegex instead.

Class RWCRExpr represents an extended regular expression such as those found in lex and awk. The constructor "compiles" the expression into a form that can be used more efficiently. The results can then be used for string searches using class RWCString. Regular expressions can be of arbitrary size, limited by memory. The extended regular expression features found here are a subset of those found in the POSIX.2 standard (ANSI/IEEE Std 1003.2, ISO/IEC 9945-2).

The regular expression (RE) is constructed as follows:

The following rules determine one-character REs that match a single character:

Any character that is not a special character (to be defined) matches itself.

  1. A backslash (\) followed by any special character matches the literal character itself; that is, its use "escapes" the special character. For example, \* matches "*" without applying the syntax of the * special character.
  2. The "special characters" are:
    + * ? . [ ] ^ $ ( ) { } | \
  3. The period (.) matches any character. For example, ".umpty" matches either "Humpty" or "Dumpty".
  4. A set of characters enclosed in brackets ([ ]) is a one-character RE that matches any of the characters in that set. This means that [akm] matches either an "a", "k", or "m". A range of characters can be indicated with a dash, as in [a-z], which matches any lower-case letter. However, if the first character of the set is the caret (^), then the RE matches any character except those in the set. It does not match the empty string. For example: [^akm] matches any character except "a", "k", or "m". The caret loses its special meaning if it is not the first character of the set.

The following rules can be used to build a multicharacter RE:

  1. Parentheses (( )) group parts of regular expressions together into subexpressions that can be treated as a single unit. For
  2. A one-character RE followed by an asterisk (*) matches zero or more occurrences of the RE. Hence, [a-z]* matches zero
  3. A one-character RE followed by a plus (+) matches one or more occurrences of the RE. Hence, [a-z]+ matches one or more
  4. A question mark (?) is an optional element. The preceding RE can occur zero or once in the string – no more. For example,
  5. The concatenation of REs is a RE that matches the corresponding concatenation of strings. For example, [A-Z][a-z]* matches
  6. The OR character ( | ) allows a choice between two regular expressions. For example, jell(y|ies) matches either "jelly" or "jellies".
  7. Braces ({ }) are reserved for future use.

All or part of the regular expression can be "anchored" to either the beginning or end of the string being searched:

  1. If the caret (^) is at the beginning of the (sub)expression, then the matched string must be at the beginning of the string
  2. If the dollar sign ($) is at the end of the (sub)expression, then the matched string must be at the end of the string being searched.

The most frequent problem with use of this class is in being able to specify a backslash character to be parsed. If you are attempting to parse a regular expression that contains backslashes, you must be aware that the C++ compiler and the regular expression constructor will both assume that any backslashes they see are intended to escape the following character. Thus, to specify a regular expression that exactly matches "a\a", you would have to create the regular expression using four backslashes as follows: the regular expression needs to see "a\\a", and for that to happen, the compiler would have to see "a\\\\a".

RWCRExpr reg("a\\\\a");
^|^|
1 2

The backslashes marked with a ^ are an escape for the compiler, and the ones marked with | will thus be seen by the regular expression parser. At that point, the backslash marked 1 is an escape, and the one marked 2 will actually be put into the regular expression.

Similarly, if you really need to escape a character, such as a '.', you will have to pass two backslashes to the compiler:

RWCRExpr regDot("\\.")
^|

Once again, the backslash marked ^ is an escape for the compiler, and the one marked with | will be seen by the regular expression constructor as an escape for the following '.'.

Synopsis
#include <rw/re.h>
RWCRExpr re(".*\\.doc$"); // Matches filename with suffix ".doc"
Persistence
None
Example
#include <iostream>
#include <rw/re.h>
#include <rw/cstring.h>
int main ()
{
RWCString s ("Hark! Hark! the lark");
std::cout << "Searching for an expression beginning with \"l\" in \""
<< s << "\".\n";
// A regular expression matching any lower-case word
// starting with 'l':
RWCRExpr reg("l[a-z]*");
// Prints 'lark'
std::cout << "Found \"" << s.match(reg) << "\"." << std::endl;
return 0;
}

Copyright © 2022 Rogue Wave Software, Inc., a Perforce company. All Rights Reserved.