POSIX Extended Regular Expression Syntax

SourcePro Core : Internationalization Module User’s Guide : Pattern Matching : Regular Expression String Searching : POSIX Extended Regular Expression Syntax

Although UTR-18 Version 6 suggests use of a Perl-like pattern syntax, the regular expression support in the Internationalization Module uses the POSIX 2 extended regular expression (ERE) pattern syntax, with Unicode extensions, suggested by UTR-18 Version 5.1. That syntax is described in Table 2.

The special characters used by RWURegularExpression are as follows:

Table 2 – RWURegularExpression special characters based on POSIX2 syntax;
Character	Meaning
+	Matches one or more occurrences of the preceding item, except in a bracket expression. For example, a+ matches a, aa, aaa, and so on.
*	Matches zero or more occurrences of the preceding item, except in a bracket expression. For example, a* matches the empty string, a, aa, and so on.
?	Matches zero or one occurrence(s) of the preceding item, except in a bracket expression. For example, a? matches the empty string and a.
{ and }	Specify a cardinality range, formed as follows: {m,n}. This construct matches between m and n occurrences of the preceding item. For example, a{2,3} matches aa and aaa. This construct can also be formed using {m,} and {m}. The first matches m or more occurrences of the preceding item. For example, a{2,} matches aa, aaa, aaaa, and so on. The second matches exactly m occurrences of the preceding item. For example, a{2} matches aa. Note: { is treated differently in a bracket expression. In this context, { denotes the beginning of a Unicode character category, as described in Unicode Regular Expressions.
[ and ]	Create a bracket expression. Bracket expressions create a set of items, any of which may be matched. For example, [abc] matches a, or b, or c. Within a bracket expression all regular expression special characters are treated as normal, non-special characters, except:: - specifies a range of character values, based on their bit pattern. For example, [A-Za-z] matches all uppercase and lowercase English characters. To indicate - as a character in the bracket expression, it must be the first or last character in the set; for example, [-a-z] or [A-Z-]. ^ is special only when placed in the first character position within the bracket set. Using ^ in the first position complements the set of items to be matched. For example, [^a-z] matches all characters except for lowercase English letters. { denotes the beginning of a Unicode character category (see Unicode Regular Expressions). To use { in a bracket expression, escape it by preceding it with the \ character as follows: [\{]. Finally, in order to include a ] as a character in the bracket set, you must include it as the first character in the set, as in []abc] or [^]abc].
( and )	Group regular expression items into subexpressions, which are treated as a single unit. For example, whereas ab* matches a, ab, abb, and so on, (ab)* matches the empty string, ab, abab, and so on. ( and ) are not treated as special characters inside a bracket expression.
\	Escapes a regular expression character, causing it to be treated as a regular character. For example, whereas (ab) indicates a subexpression consisting of ab, $ab$ denotes the sequence of characters (, a, b, and ). Note: To specify the \ character in C++ source code, you must specify \\, as the C++ compiler treats the \ character as special, denoting the beginning of an escape sequence embedded in the C++ source code. In data files, or text controls in dialog boxes, however, the double backslash is not necessary.
^	Indicates that a regular expression or subexpression is anchored at the beginning of the input string. For example, ^ab matches ab and abc, but not cab. Recall that ^ is treated differently in bracket expressions.
$	Indicates that a regular expression or subexpression is anchored at the end of the input string. For example, ab$ matches ab and cab, but not abc.
\|	Denotes alternation, or the creation of a set of equally valid, alternate expressions or subexpressions, each of which can be matched. For example, ab\|cd matches ab or cd.
.	Matches any code unit, except for those which indicate the logical end of a line, as outlined in Unicode Technical Report #18: \u2028, \u2029, \u000A, \u000B, \u000C, \u000D, \u0085.

NOTE: 	All of the above regular expression special characters are treated as special unless escaped. This differs slightly from the POSIX Extended Regular Expression standard, in which some characters are treated as special when escaped, while others are treated as special unless escaped.