XML and Character Encodings Concepts

Web Service Development Guide : PART IV Extending your Applications : Chapter 19 Internationalizing Your Services : XML and Character Encodings Concepts

This section briefly introduces some concepts useful in working with HydraExpress and XML documents in various character encodings.

What is a Character Encoding?

A character encoding -- or more formally a “coded character set” -- is a character set and its numerical representation.

If your XML document’s character encoding is anything other than UTF-8, you can use HydraExpress’ international capabilities to convert it to and from your own encoding in order to manipulate it in the encoding of your choice.

The related code examples on internationalization refer to the Unicode encoding forms UTF-8 and UTF-16, as they are used internally by HydraExpress to manipulate text and convert XML documents between UTF-8 and other encodings.

Character Encoding in an XML Prolog

An XML document always starts with a prolog. The prolog describes the contents of the document including its character encoding. The following prolog contains a mandatory version number and the optional encoding declaration.

<?xml version="1.0" encoding="Shift_JIS "?>

The entire contents of the XML document following the “EncodingDecl” section of the XML prolog must be in the specified character set. This includes everything in the message: URIs, end-of-line characters, whitespace, etc.

For example, in the XML fragment above, all characters following the “?>” must be in the Shift-JIS encoding. For more information on XML Declarations see the XML 1.0 specification at http://www.w3.org/TR/REC-xml#sec-prolog-dtd.