Tailored Unicode Regular Expression Extensions

SourcePro Core : Internationalization Module User’s Guide : Pattern Matching : Regular Expression String Searching : Unicode Regular Expressions : Tailored Unicode Regular Expression Extensions

Tailored regular expression support extends basic regular expressions. Tailored regular expression support adds Level 2 and Level 3 regular expression support as described in UTF-18 Version 5.1. (http://www.unicode.org/reports/tr18/tr18-5.1.html)

Tailored regular expression support extends basic regular expression support in the following ways.

Tailored Unicode regular expression syntax extensions

Treating surrogate pairs as characters

Tailored support recognizes surrogate pairs during pattern compilation and during pattern matching. For example, consider the pattern, \uD800\uDC00*. With basic regular expressions, the pattern compiler does not recognize \uD800\uDC00 as a surrogate pair, and interprets the pattern as \uD800 followed by zero or more occurrences of \uDC00. However, with tailored support, \uD800\uDC00 is recognized as a single code point, and the pattern is interpreted as zero or more occurrences of the code point, \uD800\uDC00. During matching, full code points are extracted for testing against “.”, categories, bracket sets, and all other constructs. Further, during search operations, only code point boundaries are considered as potential match starting positions.

The use of the script property

Tailored regular expressions allow for testing a code point for a script property. The script property uses a syntax similar to that of general categories. The syntax is as follows:

[{Script}]

As with categories, a script specification must appear in a bracket set, and must be surrounded by curly braces. Within the curly braces is the name of a script, which is case-sensitive. The following table lists scripts that are supported by tailored regular expressions.

Table 5 – Script properties supported by tailored regular expressions
Property	Property
Common	Inherited
Arabic	Armenian
Bengali	Bopomofo
Cherokee	Coptic
Cyrillic	Deseret
Devanagari	Ethiopic
Georgian	Gothic
Greek	Gujarati
Gurmukhi	Han
Hangul	Hebrew
Hiragana	Kannada
Katakana	Khmer
Lao	Latin
Malayalam	Mongolian
Myanmar	Ogham
OldItalic	Oriya
Runic	Sinhala
Syriac	Tamil
Telugu	Thaana
Thai	Tibetan
Ucas	Yi

For example, the following pattern matches one or more occurrences of a character in the Thai script: [{Thai}]+

The ability to specify code points using \v syntax

The \v syntax is given as \vXXXXXX, where each X is a valid hexadecimal digit. The \v must be followed by exactly six valid hexadecimal digits. For example, the surrogate pair, \uD800\uDC00 could be specified as \v010000. \v escape sequences can appear anywhere in a pattern, including bracket expressions. Recall that, as with any escape sequence, the \v must be double-escaped when specified in C++ source code, \\v010000. The first escape is for the C++ compiler.

Matching canonical equivalents

Tailored regular expressions match canonical equivalents. For example, the pattern, a\u0308 matches against botha\u0308and ä.

Specifying grapheme clusters

Tailored regular expressions allow for the specification of grapheme clusters using the \g syntax. The syntax for grapheme clusters is \g{grapheme}, where \g starts the grapheme cluster specification. The { and } must surround the grapheme cluster. Within the curly braces, the grapheme is specified. Grapheme clusters can appear anywhere in the pattern, including bracket sets. For example, the pattern, ab\g{ch}d, matches the string, abchd. With the traditional Spanish locale, the pattern, [\g{ch}-d], matches ch and d, but does not match c or e. Recall that, as with any escape sequence, the \v must be double-escaped when specified in C++ source code, \\v010000. The first escape is for the C++ compiler.

Performing all comparisons using collation

With tailored regular expressions, all comparisons are performed using Unicode collation. The type of collation can be specified using the setCollationStrength() method, and queried using the getCollationStrength() method. These methods may be used only with tailored regular expressions, and throw an unsupported error exception with basic regular expressions.

The collation support in RWURegularExpression is coarse-grained, meaning that it applies to the entire pattern. At this time, no fine-grained collation is supported.

If no collation strength is specified, then the default collation strength for the specified locale is used. For many locales, the default strength is Tertiary. For example in the en locale, the following pattern would use tertiary collation strength by default: résumé. At this default level, the string, résumé, would match. However, resume and Résumé would not match. On the other hand, if the collation strength for the pattern is changed to Primary, then all of the following would match: resume, résumé, and Résumé.

Tailored regular expressions, by default, do not recognize graphemes (other than those specified with \g) during pattern compilation, or when matching the “.” (or any other element).

As such, the pattern a\u0308+ would match an a followed by one or more occurrence of \u0308. Similarly, "." would match only the "a" in a\u0308. As an alternative, the InterpretGraphemes option can be used with tailored regular expressions. If this option is given as a constructor argument for a tailored regular expression, then the pattern a\u0308+ above would be interpreted as one or more occurrence of a\u0308, or ä, or any other equivalent.

Similarly, "." would match all of a\u0308.

NOTE: 	The "InterpretGraphemes" option is ignored for basic regular expressions.