SourcePro® API Reference Guide

Product Documentation:
   SourcePro
Documentation Home
List of all members | Public Member Functions

Finds delimiters in Unicode source strings, and provides sequential access to the tokens between those delimiters. More...

#include <rw/i18n/RWUTokenizer.h>

Public Member Functions

 RWUTokenizer ()
 
 RWUTokenizer (const RWUString &text)
 
 RWUTokenizer (const RWUTokenizer &source)
 
 ~RWUTokenizer ()
 
bool done () const
 
RWUString getText () const
 
RWUConstSubString nextToken ()
 
RWUConstSubString nextToken (const RWUString &str)
 
RWUConstSubString nextToken (const RWUString &str, size_t num)
 
RWUConstSubString nextToken (RWURegularExpression &regex)
 
RWUConstSubString operator() ()
 
RWUConstSubString operator() (const RWUString &str)
 
RWUConstSubString operator() (const RWUString &str, size_t num)
 
RWUConstSubString operator() (RWURegularExpression &regex)
 
RWUTokenizeroperator= (const RWUTokenizer &rhs)
 
void setText (const RWUString &text)
 

Detailed Description

RWUTokenizer finds delimiters in source strings, and provides sequential access to the tokens between those delimiters.

Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. For example, consider the string:

Token1,Token2,Token3

Using the set of delimiter characters consisting of only a comma, you could break the string into three tokens:

Token1
Token2
Token3

RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request. Any single code point within the string is a candidate delimiter.

Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the input RWUString is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N code units in the delimiter string be considered as potential delimiters, in which case the string may have embedded nulls.

Finally, you can specify the delimiters as an RWURegularExpression. This technique allows for the specification of complex, multi-character delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter consisting of a number of code points.

Two variations on the interface are provided. The first is provided using the function call operator()(). In the tradition of RWCTokenizer, this interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of a delimiter. As such, the function call operator does not return empty tokens.

The second variation on the interface is provided through a set of overloads of the nextToken() method. This version of the interface returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. When using the function call interface, either the done() method, or the traditional empty token condition can be used to detect the end of tokenization.

Example
#include <rw/i18n/RWUTokenizer.h>
#include <rw/i18n/RWUConversionContext.h>
#include <rw/i18n/RWURegularExpression.h>
#include <iostream>
using std::cout;
using std::endl;
int
main()
{
// Create a conversion context to convert between
// US-ASCII and Unicode
RWUConversionContext ascii("US-ASCII");
// Create a search string
RWUString text("John, Doe; 33,175; ; Anchorage, AK");
// Delimit fields with a `,' or a `;', followed by one or more
// of whitespace characters.
RWURegularExpression delim("[,;][{Zs}]+");
// Create a tokenizer and a string in which to receive tokens
RWUTokenizer tknzr(text);
RWUString token;
// Extract tokens using the function call operator
// interface. Note that empty tokens are *not* returned.
cout << "Using function call operator:" << endl;
for (token = tknzr(delim); !token.isNull(); token = tknzr(delim)) {
cout << " <" << token << ">" << endl;
} // for
// Reset the tokenizer.
tknzr.setText(text);
// Extract tokens again, using the nextToken() interface.
// Note that consecutive delimiters will cause nextToken()
// to return an empty token.
cout << "\nUsing nextToken():" << endl;
while (!tknzr.done()) {
token = tknzr.nextToken(delim);
cout << " <" << token << ">" << endl;
} // while
return 0;
} // main

Program output:

Using function call operator:
<John>
<Doe>
<33,175>
<Anchorage>
<AK>
Using nextToken():
<John>
<Doe>
<33,175>
<>
<Anchorage>
<AK>

Constructor & Destructor Documentation

RWUTokenizer::RWUTokenizer ( )

Default constructor. Constructs an empty RWUTokenizer with no string to be tokenized. No tokens can be obtained from such a tokenizer until the setText() method is used to assign a string to the tokenizer.

RWUTokenizer::RWUTokenizer ( const RWUString text)

Constructs an RWUTokenizer with string text to be tokenized.

RWUTokenizer::RWUTokenizer ( const RWUTokenizer source)

Copy constructor. Initializes an RWUTokenizer as a deep copy of source. The new tokenizer begins tokenizing from the location in the search string where the source tokenizer left off. Tokenizations within either tokenizer do not affect the state of the other.

RWUTokenizer::~RWUTokenizer ( )

Destructor.

Member Function Documentation

bool RWUTokenizer::done ( ) const

Returns true if the last token from the search string has been extracted; otherwise, false. When using the function call operator interface, this equates to the last non-empty token having been returned.

RWUString RWUTokenizer::getText ( ) const

Returns a copy of the string associated with self.

RWUConstSubString RWUTokenizer::nextToken ( )

Returns the next token, using default set of delimiter characters: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

This method may return an empty token if there are consecutive occurrences of any delimiter code point in the search string.

RWUConstSubString RWUTokenizer::nextToken ( const RWUString str)

Returns the next token, using the specified string str of delimiter code points.

This method may return an empty token if there are consecutive occurrences of any delimiter character in the search string.

RWUConstSubString RWUTokenizer::nextToken ( const RWUString str,
size_t  num 
)

Returns the next token, using the first num code units from the given string str as the set of delimiter code points.

This method may return an empty token if there are consecutive occurrences of any delimiter character in the search string.

RWUConstSubString RWUTokenizer::nextToken ( RWURegularExpression regex)

Returns the next token, using a delimiter pattern represented by a regular expression pattern.

Unlike the other nextToken() overloads, this method allows a single occurrence of a delimiter to span multiple characters. For example, nextToken(RWUString("ab")) treats either a or b as a delimiter character, but nextToken(RWURegularExpression("ab")) treats the two-character pattern ab as a single delimiter.

This method may return an empty token if there are consecutive occurrences of the delimiter pattern in the search string.

RWUConstSubString RWUTokenizer::operator() ( )

Returns the next token, using default set of delimiter characters: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.

RWUConstSubString RWUTokenizer::operator() ( const RWUString str)

Returns the next token, using specified string str of delimiter characters.

This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.

RWUConstSubString RWUTokenizer::operator() ( const RWUString str,
size_t  num 
)

Returns the next token, using the first num code units from the input string str as the set of delimiter characters.

This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.

RWUConstSubString RWUTokenizer::operator() ( RWURegularExpression regex)

Returns the next token, using a delimiter pattern represented by the regular expression pattern regex.

Unlike the other operator() overloads, this method allows a single occurrence of a delimiter to span multiple characters. For example, consider the RWUTokenizer instance tok. The statement tok(RWUString("ab")) treats either a or b as a delimiter character, but tok(RWURegularExpression("ab")) treats the two-character pattern ab as a single delimiter.

This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.

RWUTokenizer& RWUTokenizer::operator= ( const RWUTokenizer rhs)

Assignment operator. Initializes an RWUTokenizer as a deep copy of rhs. The new tokenizer begins tokenizing from the location in the search string where the rhs tokenizer left off. Tokenizations within either tokenizer do not affect the state of the other. Returns a reference to self.

void RWUTokenizer::setText ( const RWUString text)

Sets the string to be tokenized by self to text. The starting position is set to the beginning of the string. A deep copy of the text string is stored within the tokenizer.

Copyright © 2021 Rogue Wave Software, Inc., a Perforce company. All Rights Reserved.