Rogue Wave banner
Previous fileTop of documentContentsIndexNext file

RWCRExpr

Data Type and Member Function Indexes
(exclusive of constructors and destructors)

Synopsis

#include <rw/re.h>
RWCRExpr re(".*\\.doc");  // Matches filename with suffix ".doc"

Description

Class RWCRExpr represents an extended regular expression such as those found in lex and awk. The constructor "compiles" the expression into a form that can be used more efficiently. The results can then be used for string searches using class RWCString. Regular expressions can be of arbitrary size, limited by memory. The extended regular expression features found here are a subset of those found in the POSIX.2 standard (ANSI/IEEE Std 1003.2, ISO/IEC 9945-2).

Note: RWCRExpr is available only if your compiler supports exception handling and the C++ Standard Library.

The regular expression (RE) is constructed as follows:

The following rules determine one-character REs that match a single character:

Any character that is not a special character (to be defined) matches itself.

  1. A backslash (\) followed by any special character matches the literal character itself; that is, this "escapes" the special character.

  2. The "special characters" are:

    + * ? . [ ] ^ $ ( ) { } | \

  3. The period (.) matches any character. E.g., ".umpty" matches either "Humpty" or "Dumpty."

  4. A set of characters enclosed in brackets ([ ]) is a one-character RE that matches any of the characters in that set. E.g., "[akm]" matches either an "a", "k", or "m". A range of characters can be indicated with a dash. E.g., "[a-z]" matches any lower-case letter. However, if the first character of the set is the caret (^), then the RE matches any character except those in the set. It does not match the empty string. Example: [^akm] matches any character except "a", "k", or "m". The caret loses its special meaning if it is not the first character of the set. The following rules can be used to build a multicharacter RE:

  5. Parentheses (( )) group parts of regular expressions together into subexpressions that can be treated as a single unit. For example, (ha)+ matches one or more "ha"'s.

  6. A one-character RE followed by an asterisk (*) matches zero or more occurrences of the RE. Hence, [a-z]* matches zero or more lower-case characters.

  7. A one-character RE followed by a plus (+) matches one or more occurrences of the RE. Hence, [a-z]+ matches one or more lower-case characters.

  8. A question mark (?) is an optional element. The preceeding RE can occur zero or once in the string -- no more. E.g. xy?z matches either xyz or xz.

  9. The concatenation of REs is a RE that matches the corresponding concatenation of strings. E.g., [A-Z][a-z]* matches any capitalized word.

  10. The OR character ( | ) allows a choice between two regular expressions. For example, jell(y|ies) matches either "jelly" or "jellies".

  11. Braces ({ }) are reserved for future use.

  12. All or part of the regular expression can be "anchored" to either the beginning or end of the string being searched:

  13. If the caret (^) is at the beginning of the (sub)expression, then the matched string must be at the beginning of the string being searched.

  14. If the dollar sign ($) is at the end of the (sub)expression, then the matched string must be at the end of the string being searched.

The most frequent problem with use of this class is in being able to specify a backslash character to be parsed. If you are attempting to parse a regular expression that contains backslashes, you must be aware that the C++ compiler and the regular expression constructor will both assume that any backslashes they see are intended to escape the following character. Thus, to specify a regular expression that exactly matches "a\a", you would have to create the regular expression using four backslashes as follows: the regular expression needs to see "a\\a", and for that to happen, the compiler would have to see "a\\\\a".

     RWCRExpr reg("a\\\\a");
                    ^|^|
                     1 2

The backslashes marked with a ^ are an escape for the compiler, and the ones marked with | will thus be seen by the regular expression parser. At that point, the backslash marked 1 is an escape, and the one marked 2 will actually be put into the regular expression.

Similarly, if you really need to escape a character, such as a ".", you will have to pass two backslashes to the compiler:

    RWCRExpr regDot("\\.")
                     ^|

Once again, the backslash marked ^ is an escape for the compiler, and the one marked with | will be seen by the regular expression constructor as an escape for the following ".".

Persistence

None

Example

#include <rw/re.h>
#include <rw/cstring.h>
#include <rw/rstream.h>

main(){
  RWCString aString("Hark! Hark! the lark");

  // A regular expression matching any lowercase word or end of a  
  //word starting with "l":
     RWCRExpr re("l[a-z]*");

  cout << aString(re) << endl;  // Prints "lark"
}

Public Constructors

RWCRExpr(const char* pat);
RWCRExpr(const RWCString& pat);
RWCRExpr(const RWCRExpr& r);
RWCRExpr();

Public Destructor

~RWCRExpr();

Assignment Operators

RWCRExpr&
operator=(const RWCRExpr& r);
RWCRExpr&
operator=(const char* pat);
RWCRExpr&
operator=(const RWCString& pat);

Public Member Functions

size_t
index(const RWCString& str, size_t* len = NULL, 
      size_t start=0) const;
statusType
status() const;
statusType
Meaning
RWCRExpr::OK
No errors
RWCRExpr::NOT_SUPPORTED
POSIX.2 feature not yet supported.
RWCRExpr::NO_MATCH
Tried to find a match but failed
RWCRExpr::BAD_PATTERN
Pattern was illegal
RWCRExpr::BAD_COLLATING_ELEMENT
Invalid collating element referenced
RWCRExpr::BAD_CHAR_CLASS_TYPE
Invalid character class type referenced
RWCRExpr::TRAILING_BACKSLASH
Trailing \ in pattern
RWCRExpr::UNMATCHED_BRACKET
[] imbalance
RWCRExpr::UNMATCHED_PARENTHESIS
() imbalance
RWCRExpr::UNMATCHED_BRACE
{} imbalance
RWCRExpr::BAD_BRACE
Content of {} invalid.
RWCRExpr::BAD_CHAR_RANGE
Invalid endpoint in [a-z] expression
RWCRExpr::OUT_OF_MEMORY
Out of memory
RWCRExpr::BAD_REPEAT
?,* or + not preceded by valid regular expression


Previous fileTop of documentContentsIndexNext file
©Copyright 1999, Rogue Wave Software, Inc.
Send mail to report errors or comment on the documentation.