Specifying Delimiters

SourcePro Core : Internationalization Module User’s Guide : Boundary Analysis and Tokenizing : Tokenizing : Specifying Delimiters

Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. Consider the following string:

Token1,Token2,Token3

Using the set of delimiter characters consisting of only a comma, you could break the string into the following three tokens:

Token1

Token2

Token3

RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request.

Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the delimiter argument is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N code units in the delimiter argument are considered as potential delimiters.

Finally, you can specify the delimiter argument as an RWURegularExpression. This technique allows for the specification of complex, multicharacter delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter spanning a number of code points.

Note that the static method RWUCharTraits::getWhitespace() returns a null-terminated array of whitespace code points, as a convenience for use as delimiters.