SourcePro® API Reference Guide

 
Loading...
Searching...
No Matches

Provides unidirectional text conversion from strings in various encodings to UTF-16-encoded RWUString instances. More...

#include <rw/i18n/RWUToUnicodeConverter.h>

Inheritance diagram for RWUToUnicodeConverter:
RWUConverterBase

Classes

class  ErrorResponseState
 Stores the current error response state of the converter so the state can be restored if necessary. More...
 

Public Types

typedef RWTFunctor< bool(char &)> ConverterByteSource
 
typedef RWTFunctor< bool(RWUChar32)> DelimiterTest
 
enum  ErrorResponseType { Stop , Skip , Substitute , Escape }
 

Public Member Functions

 RWUToUnicodeConverter (const char *encoding)
 
 RWUToUnicodeConverter (const RWUConverterBase &original)
 
 RWUToUnicodeConverter (const RWUToUnicodeConverter &original)
 
 ~RWUToUnicodeConverter ()
 
void convert (char source, RWUString &target, bool flush=true)
 
void convert (const char *source, RWUString &target, bool flush=true)
 
void convert (const char source[], int32_t size, RWUString &target, bool flush=true)
 
void convert (const RWCString &source, RWUString &target, bool flush=true)
 
void convert (const std::string &source, RWUString &target, bool flush=true)
 
bool convert (ConverterByteSource source, RWUChar32 &target, size_t &currentByteCount, bool flushAtEnd=true)
 
bool convert (ConverterByteSource source, RWUString &target, size_t &currentByteCount, size_t maxNumCodePoints=0, DelimiterTest delimiterTest=RWUToUnicodeConverter::getWhitespaceDelimiterTest(), bool ignoreLeadingDelimiters=true, bool flushAtEnd=true)
 
RWUToUnicodeConverteroperator= (const RWUConverterBase &rhs)
 
RWUToUnicodeConverteroperator= (const RWUToUnicodeConverter &rhs)
 
void reset ()
 
void restoreErrorResponseState (const ErrorResponseState &state)
 
ErrorResponseState saveErrorResponseState () const
 
void setErrorResponse (ErrorResponseType response)
 
- Public Member Functions inherited from RWUConverterBase
 ~RWUConverterBase ()
 
RWCString getCanonicalName () const
 
void getLocalizedName (const RWULocale &locale, RWUString &result) const
 
size_t getMaxBytesPerChar () const
 
size_t getMinBytesPerChar () const
 
bool operator!= (const RWUConverterBase &rhs) const
 
bool operator== (const RWUConverterBase &rhs) const
 

Static Public Member Functions

static DelimiterTest getCodePointDelimiterTest (RWUChar32 delimiter)
 
static DelimiterTest getWhitespaceDelimiterTest ()
 
- Static Public Member Functions inherited from RWUConverterBase
static RWCString getCurrentLocaleEncodingName ()
 
static RWCString getDefaultEncodingName ()
 
static void setDefaultEncodingName (const char *encoding)
 

Additional Inherited Members

- Protected Member Functions inherited from RWUConverterBase
 RWUConverterBase (const char *encoding)
 
 RWUConverterBase (const RWUConverterBase &original)
 
RWUConverterBaseoperator= (const RWUConverterBase &rhs)
 

Detailed Description

RWUToUnicodeConverter provides a unidirectional text conversion facility for translating from strings in various encodings to UTF-16 encoded RWUString instances.

A converter does not synchronize modifications to its internal state, so converters cannot be shared between threads.

The convert() method appends the results of a conversion to a target buffer. If its flush argument is true, convert() flushes its internal buffers to the target buffer and clears its internal state. For modal encodings such as ISO-2022, clearing the internal state ensures that the next call to convert() can expect the source text to begin in the source encoding's default, unshifted state.

Calling convert() once with a value of true for flush is useful when converting a piece of text in its entirety from a source encoding to UTF-16. In contrast, convert() may be used to fill a target buffer in a piecemeal fashion. Repeatedly calling convert() with a value of false for flush, then calling it once with a value of true, causes convert() to flush its buffers and clear its internal state only at the end of a multi-invocation conversion process.

At the conclusion of a successful call to convert() with flush set to true, the converter is reset automatically to a default, initial state, ready to start a new conversion process. Sometimes, however, it may be necessary to reset a converter explicitly using the reset() method:

  • if convert() has thrown an exception in response to an error, and you want to be sure the converter is in the default state before using it again
  • if you are using the converter to fill a target buffer in a piecemeal fashion, and you wish to abandon that conversion process to begin another
  • if you are copying a converter, and want to be sure the copy is in the default state

Error Handling

Use setErrorResponse() to control how a converter handles ill-formed character encoding sequences in the source data. Take special care when processing multibyte encodings:

  • If the error response type is Substitute, a converter produces a substitution character whenever an ill-formed multibyte character sequence is encountered. If a truncated sequence is encountered, the byte immediately following the sequence is also consumed and ignored. If that byte starts another multibyte sequence, then the converter is no longer synchronized with the source data sequence. At this point, the converter may incorrectly interpret subsequent bytes, producing a sequence of invalid code points and substitution characters. If a multibyte character encoding sequence is truncated by the end of the source sequence, the converter does not produce a substitution character, but instead throws RWUException.
  • If the error response type is Skip, a converter skips any ill-formed multibyte character sequences that are encountered. If a truncated sequence is encountered, the byte immediately following the truncated sequence is returned as the result. If that byte starts another multibyte sequence, then this result is invalid and the converter is no longer synchronized with the source data sequence. At this point, the converter may incorrectly interpret subsequent bytes, absorbing bytes or producing invalid code points. If a multibyte character encoding sequence is truncated by the end of the source sequence, the converter does not simply skip the sequence, but instead throws RWUException.
  • Use an error response type of Stop to interrupt a conversion when a converter loses synchronization with a multibyte encoded source sequence.
Example
#include <rw/i18n/RWUFromUnicodeConverter.h>
#include <rw/i18n/RWUString.h>
#include <rw/i18n/RWUToUnicodeConverter.h>
#include <iostream>
using std::cout;
using std::endl;
int main() {
// Convert from ISO-8859-1 to UTF-16.
RWUToUnicodeConverter fromIso_8859_1("ISO-8859-1");
RWCString cstr("She sat in the caf&eacute;, sipping coffee.");
RWUString ustr;
fromIso_8859_1.convert(cstr, ustr);
// Convert from UTF-16 to US-ASCII. Note that `?' is
// substituted for `&eacute;', which cannot be represented
// in US-ASCII.
RWUFromUnicodeConverter toUsAscii("US-ASCII");
toUsAscii.setSubstitutionSequence("?", 1);
cout << ustr.toBytes(toUsAscii) << endl;
// Save the error response state
toUsAscii.saveErrorResponseState();
// Convert from UTF-16 to US-ASCII again, replacing
// `&eacute;' with an escape sequence suitable for use in
// an XML or HTML file.
toUsAscii.setErrorResponse(
cout << ustr.toBytes(toUsAscii) << endl;
// Restore the original error response state
toUsAscii.restoreErrorResponseState(state);
return 0;
}
Offers powerful and convenient facilities for manipulating strings.
Definition stdcstring.h:826
Stores the current error response state of an RWUFromUnicodeConverter converter.
Definition RWUFromUnicodeConverter.h:559
Converts text from UTF-16 to various byte-oriented standard character encoding schemes.
Definition RWUFromUnicodeConverter.h:117
@ EscapeXmlHexadecimal
Definition RWUFromUnicodeConverter.h:226
Stores and manipulates Unicode character sequences encoded as UTF-16 code units.
Definition RWUString.h:187
RWCString toBytes(RWUFromUnicodeConverter &converter=RWUFromUnicodeConversionContext::getContext().getConverter()) const
Definition RWUString.h:2535
Provides unidirectional text conversion from strings in various encodings to UTF-16-encoded RWUString...
Definition RWUToUnicodeConverter.h:152

Program output:

She sat in the caf?, sipping coffee.
She sat in the caf&xE9;, sipping coffee.
See also
RWUConverterBase, RWUToUnicodeConversionContext

Member Typedef Documentation

◆ ConverterByteSource

typedef RWTFunctor<bool(char&)> RWUToUnicodeConverter::ConverterByteSource

A callback used by one of the convert() overloads. It accepts a char reference, which it fills if possible with the next char. Return true if the next char is provided; otherwise, false.

◆ DelimiterTest

A callback used by one of the convert() overloads. The callback should return true if an RWUChar32 passed to it is a delimiter; otherwise, false.

Member Enumeration Documentation

◆ ErrorResponseType

An ErrorResponseType value indicates what action an RWUToUnicodeConverter should take when it encounters an error during the conversion process. For example, the code units in the source string may be impossible to interpret as a code point value in the source encoding. The default error response is RWUToUnicodeConverter::Substitute.

See also
setErrorResponse()
Enumerator
Stop 

Stops the conversion process, and throws an RWUException.

Skip 

Silently skips over any illegal sequences, without writing to the target buffer.

Substitute 

Substitutes illegal sequences with the Unicode substitution character, U+FFFD.

Escape 

Replaces any illegal sequences with a Xhh escaped hexadecimal representation of the bytes that comprise the illegal sequence–for example, X09XA0.

Constructor & Destructor Documentation

◆ RWUToUnicodeConverter() [1/3]

RWUToUnicodeConverter::RWUToUnicodeConverter ( const char * encoding)
inline

Constructs an RWUToUnicodeConverter for the character encoding scheme given by encoding, the US-ASCII name or alias of a character encoding scheme. See RWUAvailableEncodingList and RWUEncodingAliasList for lists the character encoding schemes recognized by the Internationalization Module.

Exceptions
RWUExceptionThrown to indicate that the converter could not be constructed. The exception carries one of the following status codes:

◆ RWUToUnicodeConverter() [2/3]

RWUToUnicodeConverter::RWUToUnicodeConverter ( const RWUConverterBase & original)
inline

Constructs a converter that is a deep copy of another converter. The new converter uses the same character encoding scheme as the original converter, and possesses the same internal state as the original converter.

Exercise care when copying converters, especially those used for stateful or multibyte encodings. The new converter may be initialized in a state that causes the converter to produce errors if used to convert a new chunk of text. Consider using reset() to restore the converter to a known state before use.

Exceptions
RWUExceptionThrown to indicate that the copy could not be completed because memory could not be allocated for the underlying implementation object.

◆ RWUToUnicodeConverter() [3/3]

RWUToUnicodeConverter::RWUToUnicodeConverter ( const RWUToUnicodeConverter & original)
inline

Constructs a converter that is a deep copy of another converter. The new converter uses the same character encoding scheme as the original converter, and possesses the same internal state as the original converter.

Exercise care when copying converters, especially those used for stateful or multibyte encodings. The new converter may be initialized in a state that causes the converter to produce errors if used to convert a new chunk of text. Consider using reset() to restore the converter to a known state before use.

Exceptions
RWUExceptionThrown to indicate that the copy could not be completed because memory could not be allocated for the underlying implementation object.

◆ ~RWUToUnicodeConverter()

RWUToUnicodeConverter::~RWUToUnicodeConverter ( )
inline

Destructor.

Member Function Documentation

◆ convert() [1/7]

void RWUToUnicodeConverter::convert ( char source,
RWUString & target,
bool flush = true )

Converts a single byte source into an equivalent sequence of UTF-16 code units and appends those code units to the current contents of the target RWUString. The source contents are interpreted according to the character encoding scheme associated with self.

The boolean value flush specifies whether self should be flushed to ensure that any code units stored in the converter's internal state are written to target. The default (true) value explicitly forces a flush and resets the converter to the known default state. This value must be set to true when the current source buffer is the last available chunk of source.

You must also be sure that the source string encodes complete characters, if the output may be flushed, as any saved state and characters would be lost.

Exceptions
RWUExceptionThrown if an unhandled conversion error occurs. The exception contains an error code of RWUTruncatedCharFound in the following situations:
  • If source interrupts an MBCS character sequence started in the previous invocation of convert().
  • If flush is true and source does not complete an MBCS sequence started in the previous invocation of convert().

The target is not modified if an exception is thrown.

◆ convert() [2/7]

void RWUToUnicodeConverter::convert ( const char * source,
RWUString & target,
bool flush = true )

Converts the sequence of bytes contained in the null-terminated source array into an equivalent sequence of UTF-16 code units and appends those code units to the current contents of the target RWUString. The source contents are interpreted according to the character encoding scheme associated with self.

The boolean value flush specifies whether self should be flushed to ensure that any code units stored in the converter's internal state are written to target. The default (true) value explicitly forces a flush and resets the converter to the known default state. This value must be set to true when the current source buffer is the last available chunk of source.

You must also be sure that the source string encodes complete characters, if the output may be flushed, as any saved state and characters would be lost.

Exceptions
RWUExceptionThrown if an unhandled conversion error occurs. The exception contains an error code of RWUTruncatedCharFound in the following situations:
  • If source interrupts an MBCS character sequence started in the previous invocation of convert().
  • If flush is true and source does not complete an MBCS sequence started in the previous invocation of convert().

The target is not modified if an exception is thrown.

◆ convert() [3/7]

void RWUToUnicodeConverter::convert ( const char source[],
int32_t size,
RWUString & target,
bool flush = true )

Converts the sequence of bytes contained in the sized source array into an equivalent sequence of UTF-16 code units and appends those code units to the current contents of the target RWUString. size specifies the number of the bytes contained in the array. The source contents are interpreted according to the character encoding scheme associated with this converter. The array may contain embedded nulls.

The boolean value flush specifies whether self should be flushed to ensure that any code units stored in the converter's internal state are written to target. The default (true) value explicitly forces a flush and resets the converter to the known default state. This value must be set to true when the current source buffer is the last available chunk of source.

You must also be sure that the source string encodes complete characters, if the output may be flushed, as any saved state and characters would be lost.

Exceptions
RWUExceptionThrown if an unhandled conversion error occurs. The exception contains an error code of RWUTruncatedCharFound in the following situations:
  • If source interrupts an MBCS character sequence started in the previous invocation of convert().
  • If flush is true and source does not complete an MBCS sequence started in the previous invocation of convert().

The target is not modified if an exception is thrown.

◆ convert() [4/7]

void RWUToUnicodeConverter::convert ( const RWCString & source,
RWUString & target,
bool flush = true )

Converts the sequence of bytes contained in the given RWCString container into an equivalent sequence of UTF-16 code units and appends those code units to the current contents of the target RWUString. The source contents are interpreted according to the character encoding scheme associated with self. source may contain embedded nulls.

The boolean value flush specifies whether self should be flushed to ensure that any code units stored in the converter's internal state are written to target. The default (true) value explicitly forces a flush and resets the converter to the known default state. This value must be set to true when the current source buffer is the last available chunk of source.

You must also be sure that the source string encodes complete characters, if the output may be flushed, as any saved state and characters would be lost.

Exceptions
RWUExceptionThrown if an unhandled conversion error occurs. The exception contains an error code of RWUTruncatedCharFound in the following situations:
  • If source interrupts an MBCS character sequence started in the previous invocation of convert().
  • If flush is true and source does not complete an MBCS sequence started in the previous invocation of convert().

The target is not modified if an exception is thrown.

◆ convert() [5/7]

void RWUToUnicodeConverter::convert ( const std::string & source,
RWUString & target,
bool flush = true )

Converts the sequence of bytes contained in the string container into an equivalent sequence of UTF-16 code units and appends those code units to the current contents of the target RWUString. The source contents are interpreted according to the character encoding scheme associated with self. The source may contain embedded nulls.

The boolean value flush specifies whether self should be flushed to ensure that any code units stored in the converter's internal state are written to target. The default (true) value explicitly forces a flush and resets the converter to the known default state. This value must be set to true when the current source buffer is the last available chunk of source.

You must also be sure that the source string encodes complete characters, if the output may be flushed, as any saved state and characters would be lost.

◆ convert() [6/7]

bool RWUToUnicodeConverter::convert ( ConverterByteSource source,
RWUChar32 & target,
size_t & currentByteCount,
bool flushAtEnd = true )

Attempts to convert the sequence of bytes produced by functor source into a single code point value and stores that value in target. Returns true if a character was produced; otherwise, false.

Because the number of bytes required to produce a single code point may not be known in advance, a functor is used to supply a sequence of bytes, one byte at a time, until enough bytes have been accumulated to complete a conversion. The functor should return a value of true to indicate that there are more bytes available, and false if there are not.

The number of bytes consumed by the conversion are added to the value in currentByteCount. A non-zero number of bytes may be consumed even if no character value is produced. This can happen when the source sequence ends in a sequence that does not encode a character; for example, a shift sequence.

If flushAtEnd is true, the converter is flushed when the functor can supply no more bytes. To detect a truncated sequence at the end of the source sequence, flushAtEnd must be true. If the functor may be given additional input data, flushAtEnd can be set to false to continue conversion from the previous state.

Exceptions
RWUExceptionThrown if a conversion error occurs.

◆ convert() [7/7]

bool RWUToUnicodeConverter::convert ( ConverterByteSource source,
RWUString & target,
size_t & currentByteCount,
size_t maxNumCodePoints = 0,
DelimiterTest delimiterTest = RWUToUnicodeConverter::getWhitespaceDelimiterTest(),
bool ignoreLeadingDelimiters = true,
bool flushAtEnd = true )

Attempts to convert the sequence of bytes produced by functor source into a single code point value and stores that value in target. Returns true if a character was produced; otherwise, false.

Because the number of bytes required to produce a single code point may not be known in advance, a functor is used to supply a sequence of bytes, one byte at a time, until enough bytes have been accumulated to complete a conversion. The functor should return a value of true to indicate that there are more bytes available, and false if there are not.

The number of bytes consumed by the conversion are added to the value in currentByteCount. A non-zero number of bytes may be consumed even if no character value is produced. This can happen when the source sequence ends in a sequence that does not encode a character; for example, a shift sequence.

Use maxNumCodePoints to limit the number of code points generated by the conversion. If maxNumCodePoints is zero, the conversion continues until a delimiter is encountered, or the end of the source sequence is reached. The actual number of code points may be greater if ill-formed sequences are encountered while using the Substitute and Skip error response types. In these cases, maxNumCodePoints really specifies the maximum number of code point conversions to attempt; some failed conversions will cause several code points to be emitted. Compare the code point length of target before and after the conversion to determine the actual number of code points produced.

The functor DelimiterTest identifies code points that should terminate a conversion. The functor must return true if an RWUChar32 value passed to it is a delimiter; otherwise, false. Any substitution code units produced because the error response type is Substitute or Escape, are also subject to the delimiter test. Delimiter code points do not appear in the result; they are discarded. Pass a default, uninitialized DelimiterTest instance to disable the test for delimiters. If no delimiters are defined, the conversion continues until maxNumCodePoints have been read or the end of the source sequence has been reached.

If ignoreLeadingDelimiters is true, any delimiter code points that appear before the first non-delimiter code point are ignored. They are not appended to the result, and are not considered in the total code point count when testing for the maxNumCodeUnits limit.

If flushAtEnd is true, the converter is flushed when the functor can supply no more bytes. To detect a truncated sequence at the end of the source sequence, flushAtEnd must be true. If the functor may be given additional input data, flushAtEnd can be set to false to continue conversion from the previous state.

If the current error response type is Substitute, this method appends a substitution character for each truncated multibyte character sequence. When this occurs, the value of the byte immediately following the sequence may be ignored or may be appended to the result. The specific behavior varies according to the source sequence and conversion. If that trailing byte was supposed to start another multibyte sequence, that sequence will also be found to be ill-formed; the converter is now "out-of-sync" with the source stream. The converter is not able to synchronize with the source stream until a single-byte character is consumed. If a multibyte character encoding sequence is truncated by the end of the source sequence when the error response type is Substitute, this method does not produce a substitution character, but instead throws an RWUException. This is not the case when the error response type is Skip.

Exceptions
RWUExceptionThrown if a conversion error occurs. Any conversion results produced prior to the exception are appended to target.

◆ getCodePointDelimiterTest()

static DelimiterTest RWUToUnicodeConverter::getCodePointDelimiterTest ( RWUChar32 delimiter)
static

Returns a DelimiterTest instance that returns true if the code point passed to it is equal to delimiter; otherwise, false. Use with convert(ConverterByteSource, RWUString&, size_t&, size_t, DelimiterTest, bool, bool).

◆ getWhitespaceDelimiterTest()

static DelimiterTest RWUToUnicodeConverter::getWhitespaceDelimiterTest ( )
static

Returns a DelimiterTest instance that returns true if a code point passed to it is a whitespace character, as defined by RWUCharTraits::isWhitespace(); otherwise, false. Use with convert(ConverterByteSource, RWUString&, size_t&, size_t, DelimiterTest, bool, bool).

◆ operator=() [1/2]

RWUToUnicodeConverter & RWUToUnicodeConverter::operator= ( const RWUConverterBase & rhs)
inline

Assignment operator. Makes self a deep copy of rhs. Self uses the same character encoding scheme as rhs, and possesses the same internal state as rhs.

Exercise care when copying converters, especially those used for stateful or multibyte encodings. The new converter may be initialized in a state that causes the converter to produce errors if used to convert a new chunk of text. Consider using reset() to restore the converter to a known state before use.

Exceptions
RWUExceptionThrown to indicate that the copy could not be completed because memory could not be allocated for the underlying implementation object.

◆ operator=() [2/2]

RWUToUnicodeConverter & RWUToUnicodeConverter::operator= ( const RWUToUnicodeConverter & rhs)
inline

Assignment operator. Makes self a deep copy of rhs. Self uses the same character encoding scheme as rhs, and possesses the same internal state as rhs.

Exercise care when copying converters, especially those used for stateful or multibyte encodings. The new converter may be initialized in a state that causes the converter to produce errors if used to convert a new chunk of text. Consider using reset() to restore the converter to a known state before use.

Exceptions
RWUExceptionThrown to indicate that the copy could not be completed because memory could not be allocated for the underlying implementation object.

◆ reset()

void RWUToUnicodeConverter::reset ( )

Resets self by clearing the internal buffers and restoring the state to a known default state.

◆ restoreErrorResponseState()

void RWUToUnicodeConverter::restoreErrorResponseState ( const ErrorResponseState & state)
inline

Restores the error handling state of the converter from a saved copy. This is the only means of restoring an error response state that existed prior to a call to setErrorResponse(). Use saveErrorResponseState() to save the error response state.

Note
The saved state from one converter may be used to set the state on another converter. However, this operation may not be safe in future versions of the Internationalization Module.

◆ saveErrorResponseState()

RWUToUnicodeConverter::ErrorResponseState RWUToUnicodeConverter::saveErrorResponseState ( ) const
inline

Saves the current error handling state of the converter. This is the only means for saving the current error response state prior to calling setErrorResponse(). Use restoreErrorResponseState() to restore the saved state.

converter.saveErrorResponse();
converter.setErrorResponseState(RWUToUnicodeConverter::Stop);
converter.restoreErrorResponseState(state);
Stores the current error response state of the converter so the state can be restored if necessary.
Definition RWUToUnicodeConverter.h:672
@ Stop
Definition RWUToUnicodeConverter.h:169
void restoreErrorResponseState(const ErrorResponseState &state)
Definition RWUToUnicodeConverter.h:779
Note
The saved state from one converter may be used to set the state on another converter. However, this operation may not be safe in future versions of the Internationalization Module.

◆ setErrorResponse()

void RWUToUnicodeConverter::setErrorResponse ( ErrorResponseType response)

Specifies the action an RWUToUnicodeConverter should take when it encounters an error during the conversion process.

Copyright © 2024 Rogue Wave Software, Inc., a Perforce company. All Rights Reserved.