11.2 Character Sets

Most relational databases were developed in environments where the primary language was English. In these environments, database servers stored character data in some variant of the CHAR or VARCHAR datatypes. As database vendors expanded beyond the English-speaking markets, demand increased for different native character sets. In response, the NCHAR and NVARCHAR datatypes were created for holding character data in national character sets. In this chapter, we use the terms:

standard character set datatypes, to mean the original CHAR or VARCHAR datatypes;

national character set datatypes, to mean the newer NCHAR and NVARCHAR datatypes.

Unfortunately, database vendors did not standardize on a common set of features and capabilities for these new datatypes. Some databases implement national character set support in their standard character datatypes and use NCHAR and NVARCHAR as synonyms. Other vendors implement the datatypes identically except for the collation sequencing capabilities. Still others use completely separate implementations for standard and national character set datatypes. The documentation provided by your database vendor should help you identify the vendor's implementation technique.

DBTools.h++ is designed to make the differences between database implementations nearly invisible, but some differences do persist. Please consult the internationalization section of your DBTools.h++ access library documentation to learn about the behavior differences.

In the examples in this chapter, we use Chinese characters that represent something similar to "hello, world." These characters were selected from the UNICODE standard. In order for these examples to run properly, the machine must have the UNICODE character set installed.

11.2.1 National Character Sets and C++ Datatypes

DBTools.h++ uses three different C++ classes to hold character string data:

RWCString is used for standard ASCII strings.

RWDBMBString is used for multibyte characters sets, such as UTF-8.

RWWString is used for wide character strings, such as UCS-2 or UCS-4.

Experienced users of Tools.h++ may point out that the RWCString class is also capable of storing multibyte character strings. While this is true, DBTools.h++ users are encouraged to use RWDBMBString for multibyte strings instead. Because some databases differentiate between multibyte and standard ASCII strings, applications using RWDBMBString for multibyte character strings maximize portability to other databases.

NOTE: You are encouraged to use RWDBMBString rather than RWCString for storing multibyte character strings.

The actual character sets used by a given system depend on several aspects of the hardware and software installation. When an operating system is installed on a machine, a character set is selected to represent the keyboard attached to the machine as well as some possible supplementary character sets. A database also has at least one character set associated with the server and one with the client.

It is important to ensure compatibility between the default character set of the operating system and the character set of the client database software. DBTools.h++ does not implement translations between character sets, but it does forward a translation request to the underlying operating system for translations between wide and multibyte strings. If there is an incompatibility between the operating system's multibyte character set and the multibyte character set expected by a database's client software, there will be problems.

From the standpoint of DBTools.h++, the character set on the database server is irrelevant. It is the responsibility of the database software to translate between the server and client character sets. It is the responsibility of the system administrator to insure that this mapping of characters sets is working properly.

NOTE: Incompatibility between the multibyte character set used by the operating system and the multibyte character set expected by database client software causes problems. It is the responsibility of your system administrator to ensure compatibility.

Having made our case, let's now look in more detail at the three different C++ classes provided by DBTools.h++ to hold character string data

11.2.2 RWWString

Class RWWString is for wide character strings. This class is taken from the Tools.h++ class library. While RWWString can hold values from any wide character set, your operating system determines significantly how the values are interpreted. If an instance of this class is used for print or screen output, the output device must understand the character set.

Note that wide character strings are rarely serialized; multibyte strings are used for serialization. To send a wide character string to a database, it must be translated into multibyte form. The operating system provides calls that translate a wide string into a multibyte string. DBTools.h++ automatically employs the translation system calls to send a wide string to a server.

Class RWWString may be used both for sending data to a database client and for holding the data that is fetched. When using a wide string with an RWDBInserter, the destination column must be able to handle characters from a national character set. Usually this would be NCHAR or NVARCHAR, but for some databases it could be other CHAR or VARCHAR variants. Similarly, when an RWWString is used in an expression with class RWDBExpr, it should be used where a national string makes sense semantically.

The following example shows the use of RWWString with both the RWDBInserter class and the RWDBSelector class. Assume that the table, t1, consists of one column of type NCHAR:

void
showSQLUsingWideStrings (RWDBDatabase& aDB)
{
  RWWString wstring("\346\202\250\345\245\275",
                    RWWString::multiByte);

  RWDBTable t1 = aDB.table("t1");
  RWDBInserter ins = t1.inserter();
  ins << wstring;
  cout << ins.asString() << endl;

  RWDBSelector sel = aDB.selector();
  sel << t1;
  sel.where (t1["a"] == wstring);
  cout << sel.asString() << endl;
}

The output to this demonstration routine is two SQL statements. They represent a SELECT statement and an INSERT statement of the form:

SELECT * FROM t1 WHERE t1.a = `


'
INSERT INTO t1 VALUES (`


')

In both cases, the actual strings reflect the dialect of SQL understood by the server, which is represented by aDB. The wide character string is converted into a multibyte string for transmitting to the database. The quoting technique required by that server for multibyte strings is placed around the multibyte string.

RWWString cannot be used directly with instances of RWDBBoundExpr. The best way to proceed is to bind an instance of RWDBMBString in its place, then use the toMultiByte member function of RWWString to refresh the value in the RWDBMBString as needed. The following example demonstrates how:

RWWString
getAWideStringFromSomewhere ()
{
  return RWWString ("\346\202\250\345\245\275",
                    RWWString::multiByte);
}

RWBoolean
moreInputToProcess ()
{
  return FALSE;
}

void
bindWideStringsDemo (RWDBDatabase& aDB)
{
  RWDBTable t1 = aDB.table("t1");

  RWWString wstring;
  RWDBMBString boundString;

  RWDBInserter ins = aDB.table("t1").inserter();
  ins << RWDBBoundExpr(&boundString);

  do {
    wstring = getAWideStringFromSomewhere();
    boundString = wstring.toMultiByte();
    ins.execute();
  } while (moreInputToProcess());
}

RWWString is also an appropriate class to use when fetching data from a database. When you use an RWDBReader, the datatype of the original column doesn't matter. Any column type may be fetched as a wide string. If the original column type is an INTEGER type, the returned integer is converted to an ASCII string and that ASCII string is widened to fit an RWWString. The same is true for all other datatypes.

The use of RWWStrings with cursors is a little more restrictive. The originating column must be of a type appropriate for national character strings. The NCHAR and NVARCHAR datatypes are always acceptable, however, for many databases CHAR and VARCHAR variants also works.

The next example demonstrates how to use of an RWWString to fetch data from a database with an RWDBCursor and RWDBReader:

void
getWideStrings (RWDBDatabase& aDB)
{
  RWDBTable t1 = aDB.table("t1");

  RWWString wstring;

  {
    cout << "t1 using a cursor" << endl;
    RWDBCursor cur = t1.cursor();
    cur << &wstring;
      while (cur.fetchRow().isValid())
        cout << wstring << endl;
  }

  {
    cout << "t1 using a reader" << endl;
    RWDBReader rdr = t1.reader();
    while (rdr()) {
      rdr >> wstring;
      cout << wstring << endl;
    }
  }
}

If the display device used understands the character sets involved and the table t1 contains the single row of our Chinese hello example, the output of this sample program should look something like this:

t1 using a cursor




t1 using a reader

Finally, RWWString can be used with RWDBStoredProc in the same manner as with other datatypes. There are no special restrictions.

11.2.3 RWDBMBString

The main purpose of class RWDBMBString is to assist DBTools.h++ in differentiating between plain ASCII strings and multibyte character set strings. For example, some databases may require quotation marks around national character set strings when inserted into or compared with national character set columns (NCHAR, NVARCHAR). If all strings are stored in RWCString instances, DBTools.h++ cannot determine which strings need special quotes and which do not.

Some databases require different treatment for standard character strings and national character strings. For this reason, it is the application programmer's responsibility to treat standard character string columns and national character string columns as different types.

If multibyte strings and national character columns are used in a DBTools.h++ application, using RWDBMBString as the implementation of the multibyte strings ensures maximum cross-database portability.

You should use RWDBMBString like any other datatype. There are no special restrictions on its use.

RWDBMBString is a direct subclass of RWCString from the Tools.h++ library. All the member functions that are available in RWCString are also available in RWDBMBString. This includes the use of regular expression and substring classes.

11.2.4 RWCString

RWCStrings can be used to hold and manipulate multibyte strings. DBTools.h++ allows their use even when associated with national character set columns, although it may not format them properly since RWCString is also the default class associated with standard character string columns. It is recommended that DBTools.h++ application programs use RWDBMBStrings for national character set columns and RWCString for standard character set columns.

11.2.5 Data Definition Language

The DBTools.h++ encapsulation of Data Definition Language (DDL) allows the creation of national character set columns when defining new tables. The public enum ValueType, from the RWDBValue class, is used by the RWDBColumn class to specify the type of a column. When creating a table, specifying a column type of RWDBValue::MBString or RWDBValue::WString results in a national character set column type. The result is the same no matter which one is used.

Some databases allow the names for table, views, stored procedures, and indices to use national character sets. DBTools.h++ allows the use of multibyte character strings in these places. If special quoting is needed by a particular database, however, DBTools.h++ does not automatically provide it. It is the application programmer's responsibility to embed any special quotation marks around a string to be used as an identifier name.

11.2.6 Using Schema Information from Result Tables

While some databases require special treatment for the standard and multibyte strings sent to servers, most do not differentiate when returning standard and multibyte strings. This has some interesting ramifications. For most databases, DBTools.h++ cannot determine if a column comes from a standard or national character set column. By default, DBTools.h++ must identify the type as RWDBValue::String. Consider this example:

void
demoReturnTypeConundrum (RWDBDatabase& aDB)
{
  RWDBSchema aSchema;
  aSchema.appendColumn("a", RWDBValue::MBString, 20);
  aDB.createTable ("t1", aSchema);

  RWDBTable t1 = aDB.table("t1");
  {
    RWDBInserter ins = t1.inserter();
    ins << RWDBMBString("");
    ins.execute();
  }

  RWDBReader rdr = t1.reader();
  RWDBTable resultTable = rdr.table();
  RWDBValue::ValueType type = resultTable.column(0).type();
  if (type == RWDBValue::String)
    cout << "This database returns String" << endl;
  else if (type == RWDBValue::MBString)
    cout << "This database returns MBString" << endl;
  else if (type == RWDBValue::WString)
    cout << "This database returns WString" << endl;
  else
    cout << "This database returned something really weird"
         << endl;
}

In this example, a table is created with one national character set column. When data is retrieved from the column, however, it comes back with a return type of simply RWDBValue::String. You must exercise care when creating new tables using the schema from result tables.

If the database does not differentiate between standard character set strings and national character set strings, the output should look like this:

This database returns String