DB2 UTF-8 Data Storage - Extraneous Byte Values

Question

I am attempting to store Unicode characters in UTF8 format on a DB2 database. I have confirmed that the charset is 1208 and the that the database is specified to hold UTF8.

I am, however, getting odd results when querying some unicode data.

select hex(firstname), firstname, from my_schema.my_table where my_pk = 1234;

The results are as below:

C383C289    Ã

The character in the result is displaying wrong. From what I gather, it's being represented by the hex values "C383C289". The actual character sent on the insert was É and should be represented in UTF8 as C389.

At this stage I'm assuming that it could be the program that I am using to query the data that is interpreting it wrong. But to what extent are the hex values (first result column) wrong? They seem to have unused fluff "83C2" between the actual bytes. Or, is "C383C289" actually correct, and some UTF8 decoding engines can't handle the fluff? This seems unlikely to me.

The client (DB2 For Toad, and WinSQL) both display the character as an Ã which is represented in UTF8 as C383.

*Edit. I tested on the CLI and it is correctly returning the É character. Am I missing something? Is the "hex" function returning something that it shouldn't be?

It looks like your insert process is botched somehow. If it properly stored 'É', hex value would be C383. — mustaccio, May 23 '14 at 15:59
@mustaccio C383 is the A, not the E. Note, I just tested from the DB2 CLI and it seems that it returns the E, somehow. — Zoran Pavlovic, May 23 '14 at 16:58
What platform is your DB2 server on? IBM i and z/OS are EBCDIC machines, rather than ASCII based. Support for various character sets is platform dependent. — WarrenT, May 23 '14 at 21:30
@WarrenT It's on an IBM AIX box. Files, and ssh sessions on the box handle utf8 encoding just fine, and we haven't had an issue with it in that regard. — Zoran Pavlovic, May 23 '14 at 22:41

Remy Lebeau · Accepted Answer · 2014-05-24T20:09:50.800

3

É (U+00C9) in UTF-8 is 0xC3 0x89.

Ã (U+00C3) in UTF-8 is 0xC3 0x83.

‰ (U+0089) in UTF-8 is 0xC2 0x89.

This means your insert code is taking É, encoding it to UTF-8 octets 0xC3 0x89 before then inserting those octets into the DB. The DB is interpreting them as individual characters 0xC3 and 0x89 and encoding them a second time into UTF-8, thus producing 0xC3 0x83 0xC2 0x89.

You need to fix your insert code to not perform that initial encode anymore, so the DB will see the original É as-is and not a pre-encoded version of it. How you actually do it is anyone's guess, since you did not show your actual insert code yet.

edited May 24 '14 at 20:09

answered May 24 '14 at 19:57

Remy Lebeau

555,201
31
458
770

Thanks, that's what was causing the issue. Ironically, it used to work fine when the DB was latin-1 encoded and this issue only crept up on us when we switched to UTF8. – Zoran Pavlovic Jan 19 '15 at 18:13
1

Right, because Latin-1 encodes most Unicode codepoints as-is for values up to and including U+00FF. For instance, `É` (U+00C9) in Latin-1 is `0xC9`. So the database appeared to be saving exactly what you were giving it. – Remy Lebeau Jan 19 '15 at 19:09

score 1 · Answer 2 · answered May 23 '14 at 18:17

This is not really an answer, just to demonstrate the correct behaviour:

> db2 "insert into t1 values ('Élan')"
DB20000I  The SQL command completed successfully.
> db2 select "hex (f1), f1 from t1"

1          F1   
---------- -----
C3896C616E Élan 

  1 record(s) selected.

DB2 UTF-8 Data Storage - Extraneous Byte Values

2 Answers2