Unicode regular expressions

This section describes the extensions to the POSIX ERE standard that are part of the RWURegularExpression syntax allowing for basic and tailored regular expressions.

Basic unicode regular expression extensions

This section details the extensions to the POSIX ERE standard that support basic Unicode regular expressions in RWURegularExpression. Basic Unicode regular expression support corresponds to Level 1 Unicode regular expression support as described in Version 5.1 of UTR-18 (http://www.unicode.org/reports/tr18/tr18-5.1.html).

All regular expression pattern strings and search strings are treated as UTF-16 character sequences. UTF-16 is the only encoding supported through the pattern matching interface to RWURegularExpression. All pattern strings are accepted as RWUString objects, or are converted from a specified encoding to RWUString objects internally before being compiled. All search strings are taken as RWUString objects. Subexpression match strings are returned as RWUString objects.

Basic Unicode regular expressions do not recognize UTF-16 surrogate pairs (Unicode code points, or characters, represented as a sequence of two 16-bit code units). Each 16-bit code unit is treated as an individual character. Character properties are obtained from the Unicode character database. Characters are compared based on their bit patterns; no collation is performed. As such, basic Unicode regular expressions are useful for the majority of Unicode strings, and are more efficient than they would be if support for surrogates and collation were required. However, if support for surrogates or collation is required, then basic regular expression support may not meet these needs.

If support for canonical equivalence is required, normalize all strings before passing them to RWURegularExpression. For more information on normalization, see RWUNormalizer.

Basic Unicode regular expression syntax extensions

  • Hexadecimal notation

    The \u syntax allows for the specification of 16-bit Unicode code units. For example, the range expression [\u0020-\u007f] matches any UTF-16 code unit with a numeric value from hexadecimal 20 through hexadecimal 7f.

  • Character categories

    Character categories must appear within a bracket set, and are denoted by the text {Category}, where Category is the name of a category to be matched. For example, [{L}{Zs}]* matches zero or more occurrences of any character that is either a letter (L) or a space separator (Zs).

    The following two tables list all of the character category names supported by RWURegularExpression. The table RWURegularExpression character categories based on UTR-18 includes character categories based on UTR-18. The table Rogue Wave-specific extensions to character categories includes Rogue Wave-specific character category extensions.

    An exception is thrown if any other text appears as a category name.

    RWURegularExpression character categories based on UTR-18

    Category

    Description

    Category

    Description

    L

    All Letters

    Pf

    Final Quote Punctuation

    Lu

    Uppercase Letters

    Po

    Other Punctuation

    Ll

    Lowercase Letters

    S

    All Symbols

    Lt

    Titlecase Letters

    Sm

    Math Symbols

    Lm

    Modifier Letters

    Sc

    Currency Symbols

    Lo

    Other Letters

    Sk

    Modifier Symbols

    M

    All Marks

    So

    Other Symbols

    Mn

    Non-Spacing Marks

    Z

    All Separators

    Mc

    Spacing Combining Marks

    Zs

    Space Separators

    Me

    Enclosing Marks

    Zl

    Line Separator

    N

    All Numbers

    Zp

    Paragraph Separator

    Nd

    Number, Decimal Digit

    C

    “Other” Characters. Same as the union of Cc, Cf, Cs, Co, and Cn.

    Nl

    Number, Letter

    Cc

    Other, Control

    No

    Number, Other

    Cf

    Other, Format

    P

    All Punctuation Characters

    Cs

    Other, Surrogate

    Pc

    Connector Punctuation

    Co

    Other, Private Use

    Pd

    Dash Punctuation

    ALL

    Matches All Code Units

    Ps

    Open Punctuation

    ASSIGNED1

    Matches All Assigned Code Units

    Pe

    Close Punctuation

    UNASSIGNED

    Matches All Unassigned Code Units (the opposite of ASSIGNED)

    Pi

    Initial Quote Punctuation

     

     

    The following table contains Rogue Wave-specific extensions to the set of character categories outlined in UTR-18.

    Rogue Wave-specific extensions to character categories

    Character

    Description

    WB2

    Matches Word Breaks. Matches a word boundary, much like the \b construct in Perl.

    CB

    Matches Character Breaks

    LB

    Matches Line Breaks

    SB

    Matches Sentence Breaks

    BOL1

    Matches at the beginning of a line. Matches at the beginning of a string, or any of the following: \u2028, \u2029, \u000D\u000A, \u000A, \u000B, \u000C, \u000D, or \u0085.

    EOL1

    Matches at the end of a line. This matches at the end of a string, or any of the following: \u2028, \u2029, \u000D\u000A, \u000A, \u000B, \u000C, \u000D, or \u0085.

  • Subtraction

    Subtraction allows a regular expression pattern to express the removal of a set of items from an existing bracket set. The syntax for such a construct is: [OriginalSet-[SubtractedSet]], where OriginalSet is a bracket set, and SubtractedSet is a bracket set of items to remove from the OriginalSet. For example, [{L}-[{Lu}]] matches all letters except for uppercase letters. Similarly, [{ASSIGNED}-[{C}]] matches all assigned Unicode characters, except for any characters that fall into the “Other” category.

  • Simple word boundaries

    This feature of basic (Level 1) Unicode regular expressions is available through the use of the WB category, described in the prior table, Rogue Wave-specific extensions to character categories.

  • Simple loose matches

    The only type of loose matches for basic Unicode regular expressions described in UTR-18 are caseless matches. Caseless matching is available in RWURegularExpression through the use of the IgnoreCase option to the constructor.

  • Line breaks

    Line breaks can be matched using RWURegularExpression through the use of the {BOL} and {EOL} extended categories. ^ and $ are not used to denote the beginning and ending of lines, as this conflicts with the POSIX requirements for these characters. POSIX requires that these characters anchor only at the beginning and ending of an entire string.

Tailored unicode regular expression extensions

Tailored regular expression support extends basic regular expressions. Tailored regular expression support adds Level 2 and Level 3 regular expression support as described in UTF-18 Version 5.1. (http://www.unicode.org/reports/tr18/tr18-5.1.html)

Tailored regular expression support extends basic regular expression support in the following ways.

Tailored Unicode regular expression syntax extensions

  • Treating surrogate pairs as characters

    Tailored support recognizes surrogate pairs during pattern compilation and during pattern matching. For example, consider the pattern, \uD800\uDC00*. With basic regular expressions, the pattern compiler does not recognize \uD800\uDC00 as a surrogate pair, and interprets the pattern as \uD800 followed by zero or more occurrences of \uDC00. However, with tailored support, \uD800\uDC00 is recognized as a single code point, and the pattern is interpreted as zero or more occurrences of the code point, \uD800\uDC00. During matching, full code points are extracted for testing against “.”, categories, bracket sets, and all other constructs. Further, during search operations, only code point boundaries are considered as potential match starting positions.

  • The use of the script property

    Tailored regular expressions allow for testing a code point for a script property. The script property uses a syntax similar to that of general categories. The syntax is as follows:

    [{Script}]

As with categories, a script specification must appear in a bracket set, and must be surrounded by curly braces. Within the curly braces is the name of a script, which is case-sensitive. The following table lists scripts that are supported by tailored regular expressions.

Script properties supported by tailored regular expressions

Property

Property

Common

Inherited

Arabic

Armenian

Bengali

Bopomofo

Cherokee

Coptic

Cyrillic

Deseret

Devanagari

Ethiopic

Georgian

Gothic

Greek

Gujarati

Gurmukhi

Han

Hangul

Hebrew

Hiragana

Kannada

Katakana

Khmer

Lao

Latin

Malayalam

Mongolian

Myanmar

Ogham

OldItalic

Oriya

Runic

Sinhala

Syriac

Tamil

Telugu

Thaana

Thai

Tibetan

Ucas

Yi

For example, the following pattern matches one or more occurrences of a character in the Thai script: [{Thai}]+

  • The ability to specify code points using \v syntax

  • The \v syntax is given as \vXXXXXX, where each X is a valid hexadecimal digit. The \v must be followed by exactly six valid hexadecimal digits. For example, the surrogate pair, \uD800\uDC00 could be specified as \v010000. \v escape sequences can appear anywhere in a pattern, including bracket expressions. Recall that, as with any escape sequence, the \v must be double-escaped when specified in C++ source code, \\v010000. The first escape is for the C++ compiler.

  • Matching canonical equivalents

    Tailored regular expressions match canonical equivalents. For example, the pattern, a\u0308 matches against botha\u0308and ä.

  • Specifying grapheme clusters

    Tailored regular expressions allow for the specification of grapheme clusters using the \g syntax. The syntax for grapheme clusters is \g{grapheme}, where \g starts the grapheme cluster specification. The { and } must surround the grapheme cluster. Within the curly braces, the grapheme is specified. Grapheme clusters can appear anywhere in the pattern, including bracket sets. For example, the pattern, ab\g{ch}d, matches the string, abchd. With the traditional Spanish locale, the pattern, [\g{ch}-d], matches ch and d, but does not match c or e. Recall that, as with any escape sequence, the \v must be double-escaped when specified in C++ source code, \\v010000. The first escape is for the C++ compiler.

  • Performing all comparisons using collation

    With tailored regular expressions, all comparisons are performed using Unicode collation. The type of collation can be specified using the setCollationStrength() method, and queried using the getCollationStrength() method. These methods may be used only with tailored regular expressions, and throw an unsupported error exception with basic regular expressions.

    The collation support in RWURegularExpression is coarse-grained, meaning that it applies to the entire pattern. At this time, no fine-grained collation is supported.

    If no collation strength is specified, then the default collation strength for the specified locale is used. For many locales, the default strength is Tertiary. For example in the en locale, the following pattern would use tertiary collation strength by default: résumé. At this default level, the string, résumé, would match. However, resume and Résumé would not match. On the other hand, if the collation strength for the pattern is changed to Primary, then all of the following would match: resume, résumé, and Résumé.

Tailored regular expressions, by default, do not recognize graphemes (other than those specified with \g) during pattern compilation, or when matching the “.” (or any other element).

As such, the pattern a\u0308+ would match an a followed by one or more occurrence of \u0308. Similarly, "." would match only the "a" in a\u0308. As an alternative, the InterpretGraphemes option can be used with tailored regular expressions. If this option is given as a constructor argument for a tailored regular expression, then the pattern a\u0308+ above would be interpreted as one or more occurrence of a\u0308, or ä, or any other equivalent.

Similarly, "." would match all of a\u0308.

The "InterpretGraphemes" option is ignored for basic regular expressions.

How to use tailored regular expressions

To allow RWURegularExpression to use the tailored regular expression features, you may pass RWURegularExpression::Tailored as the second argument of the constructor as follows:

RWURegularExpression re(SomeRWUString,RWURegularExpression::Tailored);

or you may construct first, then set the level:

re.setLevel(RWURegularExpression::Tailored);

For more information on creating a regular expression, see How to Create an RWURegularExpression.