Conversion Errors

SourcePro Core : Internationalization Module User’s Guide : Conversion : Explicit Conversions : Conversion Errors

Conversion Errors

A conversion simply maps characters from a source encoding to a target encoding. Normally this is a straightforward process of replacing all the code point values for characters in the source encoding with the code point values for those characters in the target encoding. However, errors can occur in this process. For example, the character being converted may not have a representation in the target encoding, or the code units in the source string may be impossible to interpret as a code point value in the source encoding. When errors such as these occur, the converter can respond in several ways:

stop the conversion process and throw an exception

skip over the offending code units, without appending anything to the output buffer

substitute for the offending code units by appending a specific substitution sequence to the output buffer

escape the offending code units by appending a numeric representation of the code units to the output buffer

For both RWUToUnicodeConverter and RWUFromUnicodeConverter, the default error-handling response is to substitute for the offending character. RWUToUnicodeConverter uses U+FFFD as its substitution sequence. RWUFromUnicodeConverter uses a substitution sequence appropriate for the target encoding. For example, the substitution sequence for most ASCII-based encodings is 0x1a. You can change the default substitution sequence for a conversion from Unicode by calling RWUFromUnicodeConverter::setSubstitutionSequence().

To change a converter’s error-handling behavior, call method RWUToUnicodeConverter::setErrorResponse() or method RWUFromUnicodeConverter::setErrorResponse(). Each of these methods accepts an enum value. The set of available enum values depends on the direction of the converter. The function RWUToUncodeConverter::setErrorResponse() accepts the following enum values:

RWUToUnicodeConverter::Stop

Stops the conversion process on error, and throws an RWUException.

RWUToUnicodeConverter::Skip

Silently skips over any illegal sequences, without writing to the target buffer.

RWUToUnicodeConverter::Substitute

Substitutes illegal sequences with the Unicode substitution character, U+FFFD.

RWUToUnicodeConverter::Escape

Replaces any illegal sequences with an Xhh escaped hexadecimal representation of the bytes that comprise the illegal sequence; for example, X09XA0.

The function RWUFromUnicodeConverter::setErrorResponse() provides a similar set of error-handling tactics, but supports a wider variety of escaping options to facilitate working with different target encodings:

RWUFromUnicodeConverter::Stop

Stops the conversion process on error, and throws an RWUException.

RWUFromUnicodeConverter::Skip

Silently skips over any illegal sequences, without writing to the target buffer.

RWUFromUnicodeConverter::Substitute

Substitutes illegal sequences with the current substitution sequence. The default substitution sequence depends on the target encoding. For ASCII-based encodings, the default substitution sequence is 0x1A. The setSubstitutionSequence() method allows you to specify the substitution sequence.

RWUFromUnicodeConverter::EscapeNativeHexadecimal

Replaces illegal sequences with a %UX escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, %UFFFE%U00AC. Note that a code point represented by a surrogate pair is escaped as two hexadecimal values; for example, %UD84D%UDC56. If the target encoding does not support the characters {U,%}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.

RWUFromUnicodeConverter::EscapeJavaHexadecimal

Replaces illegal sequences with a \uX escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, \uFFFE\u00AC. Note that a code point represented by a surrogate pair is escaped as two hexadecimal values; for example, \uD84D\uDC56. If the target encoding does not support the characters {u,\}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.

RWUFromUnicodeConverter::EscapeCHexadecimal

Replaces illegal sequences with a \uX escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, \uFFFE\u00AC. Note that a code point represented by a surrogate pair is escaped as a single hexadecimal value; for example, \u00023456. If the target encoding does not support the characters {u,\}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.

RWUFromUnicodeConverter::EscapeXmlDecimal

Replaces illegal sequences with a &#DDDD; escaped decimal representation of the code units that comprise the illegal sequence; for example, ¬. Note that a code point represented by a surrogate pair is escaped as a single decimal value without zero padding; for example, 𣑖. If the target encoding does not support the characters {&,#,;}[0-9], an illegal sequence is replaced by the substitution sequence.

RWUFromUnicodeConverter::EscapeXmlHexadecimal

Replaces illegal sequences with a &#XXXX; escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, ¬. Note that a code point represented by surrogate pair is escaped as a single hexadecimal value without zero padding; for example, 𒍅. If the target encoding does not support the characters {&,#,x,;}[0-9], an illegal sequence is replaced by the substitution sequence.