Conversion Errors
A conversion simply maps characters from a source encoding to a target encoding. Normally this is a straightforward process of replacing all the code point values for characters in the source encoding with the code point values for those characters in the target encoding. However, errors can occur in this process. For example, the character being converted may not have a representation in the target encoding, or the code units in the source string may be impossible to interpret as a code point value in the source encoding. When errors such as these occur, the converter can respond in several ways:
stop the conversion process and throw an exception
skip over the offending code units, without appending anything to the output buffer
substitute for the offending code units by appending a specific substitution sequence to the output buffer
escape the offending code units by appending a numeric representation of the code units to the output buffer
For both
RWUToUnicodeConverter and
RWUFromUnicodeConverter, the default error-handling response is to substitute for the offending character.
RWUToUnicodeConverter uses
U+FFFD as its substitution sequence.
RWUFromUnicodeConverter uses a substitution sequence appropriate for the target encoding. For example, the substitution sequence for most ASCII-based encodings is
0x1a. You can change the default substitution sequence for a conversion from Unicode by calling
RWUFromUnicodeConverter::setSubstitutionSequence().
To change a converter’s error-handling behavior, call method RWUToUnicodeConverter::setErrorResponse() or method RWUFromUnicodeConverter::setErrorResponse(). Each of these methods accepts an enum value. The set of available enum values depends on the direction of the converter. The function RWUToUncodeConverter::setErrorResponse() accepts the following enum values:
RWUToUnicodeConverter::Stop Stops the conversion process on error, and throws an
RWUException.
RWUToUnicodeConverter::Skip Silently skips over any illegal sequences, without writing to the target buffer.
RWUToUnicodeConverter::Substitute Substitutes illegal sequences with the Unicode substitution character, U+FFFD.
RWUToUnicodeConverter::Escape Replaces any illegal sequences with an Xhh escaped hexadecimal representation of the bytes that comprise the illegal sequence; for example, X09XA0.
The function RWUFromUnicodeConverter::setErrorResponse() provides a similar set of error-handling tactics, but supports a wider variety of escaping options to facilitate working with different target encodings:
RWUFromUnicodeConverter::Stop Stops the conversion process on error, and throws an
RWUException.
RWUFromUnicodeConverter::Skip Silently skips over any illegal sequences, without writing to the target buffer.
RWUFromUnicodeConverter::Substitute Substitutes illegal sequences with the current substitution sequence. The default substitution sequence depends on the target encoding. For ASCII-based encodings, the default substitution sequence is 0x1A. The setSubstitutionSequence() method allows you to specify the substitution sequence.
RWUFromUnicodeConverter::EscapeNativeHexadecimal Replaces illegal sequences with a %UX escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, %UFFFE%U00AC. Note that a code point represented by a surrogate pair is escaped as two hexadecimal values; for example, %UD84D%UDC56. If the target encoding does not support the characters {U,%}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.
RWUFromUnicodeConverter::EscapeJavaHexadecimal Replaces illegal sequences with a \uX escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, \uFFFE\u00AC. Note that a code point represented by a surrogate pair is escaped as two hexadecimal values; for example, \uD84D\uDC56. If the target encoding does not support the characters {u,\}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.
RWUFromUnicodeConverter::EscapeCHexadecimal Replaces illegal sequences with a \uX escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, \uFFFE\u00AC. Note that a code point represented by a surrogate pair is escaped as a single hexadecimal value; for example, \u00023456. If the target encoding does not support the characters {u,\}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.
RWUFromUnicodeConverter::EscapeXmlDecimal Replaces illegal sequences with a &#DDDD; escaped decimal representation of the code units that comprise the illegal sequence; for example, ¬. Note that a code point represented by a surrogate pair is escaped as a single decimal value without zero padding; for example, 𣑖. If the target encoding does not support the characters {&,#,;}[0-9], an illegal sequence is replaced by the substitution sequence.
RWUFromUnicodeConverter::EscapeXmlHexadecimal Replaces illegal sequences with a &#XXXX; escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, ¬. Note that a code point represented by surrogate pair is escaped as a single hexadecimal value without zero padding; for example, 𒍅. If the target encoding does not support the characters {&,#,x,;}[0-9], an illegal sequence is replaced by the substitution sequence.