Finds delimiters in Unicode source strings, and provides sequential access to the tokens between those delimiters. More...
#include <rw/i18n/RWUTokenizer.h>
Public Member Functions | |
RWUTokenizer () | |
RWUTokenizer (const RWUString &text) | |
RWUTokenizer (const RWUTokenizer &source) | |
~RWUTokenizer () | |
bool | done () const |
RWUString | getText () const |
RWUConstSubString | nextToken () |
RWUConstSubString | nextToken (const RWUString &str) |
RWUConstSubString | nextToken (const RWUString &str, size_t num) |
RWUConstSubString | nextToken (RWURegularExpression ®ex) |
RWUConstSubString | operator() () |
RWUConstSubString | operator() (const RWUString &str) |
RWUConstSubString | operator() (const RWUString &str, size_t num) |
RWUConstSubString | operator() (RWURegularExpression ®ex) |
RWUTokenizer & | operator= (const RWUTokenizer &rhs) |
void | setText (const RWUString &text) |
RWUTokenizer finds delimiters in source strings, and provides sequential access to the tokens between those delimiters.
Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. For example, consider the string:
Using the set of delimiter characters consisting of only a comma, you could break the string into three tokens:
RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request. Any single code point within the string is a candidate delimiter.
Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009
(horizontal tab), 0x000A
(line feed), 0x000C
(form feed), 0x000D
(carriage return), 0x0020
(space), 0x0085
(next line), 0x2028
(line separator), 0x2029
(paragraph separator), and 0x0000
(null).
Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the input RWUString is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N
code units in the delimiter string be considered as potential delimiters, in which case the string may have embedded nulls.
Finally, you can specify the delimiters as an RWURegularExpression. This technique allows for the specification of complex, multi-character delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter consisting of a number of code points.
Two variations on the interface are provided. The first is provided using the function call operator()(). In the tradition of RWCTokenizer, this interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of a delimiter. As such, the function call operator does not return empty tokens.
The second variation on the interface is provided through a set of overloads of the nextToken() method. This version of the interface returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. When using the function call interface, either the done() method, or the traditional empty token condition can be used to detect the end of tokenization.
Program output:
RWUTokenizer::RWUTokenizer | ( | ) |
Default constructor. Constructs an empty RWUTokenizer with no string to be tokenized. No tokens can be obtained from such a tokenizer until the setText() method is used to assign a string to the tokenizer.
RWUTokenizer::RWUTokenizer | ( | const RWUString & | text | ) |
Constructs an RWUTokenizer with string text to be tokenized.
RWUTokenizer::RWUTokenizer | ( | const RWUTokenizer & | source | ) |
Copy constructor. Initializes an RWUTokenizer as a deep copy of source. The new tokenizer begins tokenizing from the location in the search string where the source tokenizer left off. Tokenizations within either tokenizer do not affect the state of the other.
RWUTokenizer::~RWUTokenizer | ( | ) |
Destructor.
bool RWUTokenizer::done | ( | ) | const |
Returns true
if the last token from the search string has been extracted; otherwise, false
. When using the function call operator interface, this equates to the last non-empty token having been returned.
RWUString RWUTokenizer::getText | ( | ) | const |
Returns a copy of the string associated with self.
RWUConstSubString RWUTokenizer::nextToken | ( | ) |
Returns the next token, using default set of delimiter characters: 0x0009
(horizontal tab), 0x000A
(line feed), 0x000C
(form feed), 0x000D
(carriage return), 0x0020
(space), 0x0085
(next line), 0x2028
(line separator), 0x2029
(paragraph separator), and 0x0000
(null).
This method may return an empty token if there are consecutive occurrences of any delimiter code point in the search string.
RWUConstSubString RWUTokenizer::nextToken | ( | const RWUString & | str | ) |
Returns the next token, using the specified string str of delimiter code points.
This method may return an empty token if there are consecutive occurrences of any delimiter character in the search string.
RWUConstSubString RWUTokenizer::nextToken | ( | const RWUString & | str, |
size_t | num | ||
) |
Returns the next token, using the first num code units from the given string str as the set of delimiter code points.
This method may return an empty token if there are consecutive occurrences of any delimiter character in the search string.
RWUConstSubString RWUTokenizer::nextToken | ( | RWURegularExpression & | regex | ) |
Returns the next token, using a delimiter pattern represented by a regular expression pattern.
Unlike the other nextToken() overloads, this method allows a single occurrence of a delimiter to span multiple characters. For example, nextToken(RWUString("ab"))
treats either a
or b
as a delimiter character, but nextToken(RWURegularExpression("ab"))
treats the two-character pattern ab
as a single delimiter.
This method may return an empty token if there are consecutive occurrences of the delimiter pattern in the search string.
RWUConstSubString RWUTokenizer::operator() | ( | ) |
Returns the next token, using default set of delimiter characters: 0x0009
(horizontal tab), 0x000A
(line feed), 0x000C
(form feed), 0x000D
(carriage return), 0x0020
(space), 0x0085
(next line), 0x2028
(line separator), 0x2029
(paragraph separator), and 0x0000
(null).
This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.
RWUConstSubString RWUTokenizer::operator() | ( | const RWUString & | str | ) |
Returns the next token, using specified string str of delimiter characters.
This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.
RWUConstSubString RWUTokenizer::operator() | ( | const RWUString & | str, |
size_t | num | ||
) |
Returns the next token, using the first num code units from the input string str as the set of delimiter characters.
This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.
RWUConstSubString RWUTokenizer::operator() | ( | RWURegularExpression & | regex | ) |
Returns the next token, using a delimiter pattern represented by the regular expression pattern regex.
Unlike the other operator() overloads, this method allows a single occurrence of a delimiter to span multiple characters. For example, consider the RWUTokenizer instance tok
. The statement tok(RWUString("ab"))
treats either a
or b
as a delimiter character, but tok(RWURegularExpression("ab"))
treats the two-character pattern ab
as a single delimiter.
This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.
RWUTokenizer& RWUTokenizer::operator= | ( | const RWUTokenizer & | rhs | ) |
Assignment operator. Initializes an RWUTokenizer as a deep copy of rhs. The new tokenizer begins tokenizing from the location in the search string where the rhs tokenizer left off. Tokenizations within either tokenizer do not affect the state of the other. Returns a reference to self.
void RWUTokenizer::setText | ( | const RWUString & | text | ) |
Sets the string to be tokenized by self to text. The starting position is set to the beginning of the string. A deep copy of the text string is stored within the tokenizer.
Copyright © 2020 Rogue Wave Software, Inc. All Rights Reserved. |