Using the nextToken() Method

SourcePro Core : Internationalization Module User’s Guide : Boundary Analysis and Tokenizing : Tokenizing : Extracting Tokens : Using the nextToken() Method

The nextToken() interface simply returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. For example, this code extracts all tokens from a string using the nextToken() method:

RWUConversionContext ascii("ascii");

RWUString text("John,Doe;,,33,175;");

RWUString delimiters(",;");

RWUString next;

RWUTokenizer tok(text);

while (!tok.done()) {

next = tok.nextToken(delimiters);

// Process the token

}

The following tokens are extracted by this code:

John

Doe

175

Note that the comma and semicolon characters act as delimiters, and are specified using an RWUString.

In this case, two empty tokens are extracted by nextToken(). If the function call operator tokenizing interface had been used instead, the empty tokens would not be returned.

This code below illustrates tokenizing a string using a regular expression delimiter and the nextToken() interface:

RWUConversionContext ascii("ascii");

RWUString text("John, Doe, 33,175;");

RWURegularExpression delimiters(RWCString("[{Zs}]*[,;][{Zs}]*"));