Basic Unicode Regular Expression Extensions

SourcePro Core : Internationalization Module User’s Guide : Pattern Matching : Regular Expression String Searching : Unicode Regular Expressions : Basic Unicode Regular Expression Extensions

This section details the extensions to the POSIX ERE standard that support basic Unicode regular expressions in RWURegularExpression. Basic Unicode regular expression support corresponds to Level 1 Unicode regular expression support as described in Version 5.1 of UTR-18 (http://www.unicode.org/reports/tr18/tr18-5.1.html).

All regular expression pattern strings and search strings are treated as UTF-16 character sequences. UTF-16 is the only encoding supported through the pattern matching interface to RWURegularExpression. All pattern strings are accepted as RWUString objects, or are converted from a specified encoding to RWUString objects internally before being compiled. All search strings are taken as RWUString objects. Subexpression match strings are returned as RWUString objects.

Basic Unicode regular expressions do not recognize UTF-16 surrogate pairs (Unicode code points, or characters, represented as a sequence of two 16-bit code units). Each 16-bit code unit is treated as an individual character. Character properties are obtained from the Unicode character database. Characters are compared based on their bit patterns; no collation is performed. As such, basic Unicode regular expressions are useful for the majority of Unicode strings, and are more efficient than they would be if support for surrogates and collation were required. However, if support for surrogates or collation is required, then basic regular expression support may not meet these needs.

If support for canonical equivalence is required, normalize all strings before passing them to RWURegularExpression. For more information on normalization, see RWUNormalizer.

Basic Unicode regular expression syntax extensions

Hexadecimal notation

The \u syntax allows for the specification of 16-bit Unicode code units. For example, the range expression [\u0020-\u007f] matches any UTF-16 code unit with a numeric value from hexadecimal 20 through hexadecimal 7f.

Character categories

Character categories must appear within a bracket set, and are denoted by the text {Category}, where Category is the name of a category to be matched. For example, [{L}{Zs}]* matches zero or more occurrences of any character that is either a letter (L) or a space separator (Zs).

The following two tables list all of the character category names supported by RWURegularExpression. Table 3 includes character categories based on UTR-18. Table 4 includes Rogue Wave-specific character category extensions.

An exception is thrown if any other text appears as a category name.

Table 3 – RWURegularExpression character categories based on UTR-18
Category	Description	Category	Description
L	All Letters	Pf	Final Quote Punctuation
Lu	Uppercase Letters	Po	Other Punctuation
Ll	Lowercase Letters	S	All Symbols
Lt	Titlecase Letters	Sm	Math Symbols
Lm	Modifier Letters	Sc	Currency Symbols
Lo	Other Letters	Sk	Modifier Symbols
M	All Marks	So	Other Symbols
Mn	Non-Spacing Marks	Z	All Separators
Mc	Spacing Combining Marks	Zs	Space Separators
Me	Enclosing Marks	Zl	Line Separator
N	All Numbers	Zp	Paragraph Separator
Nd	Number, Decimal Digit	C	“Other” Characters. Same as the union of Cc, Cf, Cs, Co, and Cn.
Nl	Number, Letter	Cc	Other, Control
No	Number, Other	Cf	Other, Format
P	All Punctuation Characters	Cs	Other, Surrogate
Pc	Connector Punctuation	Co	Other, Private Use
Pd	Dash Punctuation	ALL	Matches All Code Units
Ps	Open Punctuation	ASSIGNED¹	Matches All Assigned Code Units
Pe	Close Punctuation	UNASSIGNED	Matches All Unassigned Code Units (the opposite of ASSIGNED)
Pi	Initial Quote Punctuation

¹ A code point is “assigned” if it has a category other than RWUCharTraits::Unassigned. All code points assigned a category, as well as the blocks of code points allocated for private use, are "assigned."

The following table contains Rogue Wave-specific extensions to the set of character categories outlined in UTR-18.

Table 4 – Rogue Wave-specific extensions to character categories
Character	Description
WB ¹	Matches Word Breaks. Matches a word boundary, much like the \b construct in Perl.
CB	Matches Character Breaks
LB	Matches Line Breaks
SB	Matches Sentence Breaks
BOL1	Matches at the beginning of a line. Matches at the beginning of a string, or any of the following: \u2028, \u2029, \u000D\u000A, \u000A, \u000B, \u000C, \u000D, or \u0085.
EOL1	Matches at the end of a line. This matches at the end of a string, or any of the following: \u2028, \u2029, \u000D\u000A, \u000A, \u000B, \u000C, \u000D, or \u0085.

¹ If this category appears in a bracket set, then that bracket set, or any enclosing subexpression without additional data, must not have + or * cardinality, or the pattern is flagged as an invalid pattern, and an exception of type InfiniteEmptyMatch is thrown.

Subtraction

Subtraction allows a regular expression pattern to express the removal of a set of items from an existing bracket set. The syntax for such a construct is: [OriginalSet-[SubtractedSet]], where OriginalSet is a bracket set, and SubtractedSet is a bracket set of items to remove from the OriginalSet. For example, [{L}-[{Lu}]] matches all letters except for uppercase letters. Similarly, [{ASSIGNED}-[{C}]] matches all assigned Unicode characters, except for any characters that fall into the “Other” category.

Simple word boundaries

This feature of basic (Level 1) Unicode regular expressions is available through the use of the WB category, described in Table 4.

Simple loose matches

The only type of loose matches for basic Unicode regular expressions described in UTR-18 are caseless matches. Caseless matching is available in RWURegularExpression through the use of the IgnoreCase option to the constructor.

Line breaks

Line breaks can be matched using RWURegularExpression through the use of the {BOL} and {EOL} extended categories. ^ and $ are not used to denote the beginning and ending of lines, as this conflicts with the POSIX requirements for these characters. POSIX requires that these characters anchor only at the beginning and ending of an entire string.