2

It can't get correct value to query Chinese value from latin1 field (varchar or char) Using JDBC for MySQL. The character can't be changed. The following is the test steps. Is it possible to get the correct value for this situation? What's more should I do?

1. create database:

CREATE DATABASE TESTDB CHARACTER SET latin1 COLLATE latin1_general_ci;  

2. create table:

CREATE TABLE TB1 (
    vname varchar(50) default '',
    cname char(50) default ''
) DEFAULT CHARSET=latin1 ;

3. insert data:

set names latin1;  
insert into TB1(vname,cname) values('名字v','名字c');  

4. show character set:

mysql> show variables like '%set%';  
+--------------------------+---------------------------------------------------------+  
| Variable_name            | Value                                                   |  
+--------------------------+---------------------------------------------------------+  
| auto_increment_offset    | 1                                                       |  
| character_set_client     | latin1                                                  |  
| character_set_connection | latin1                                                  |  
| character_set_database   | latin1                                                  |  
| character_set_filesystem | binary                                                  |  
| character_set_results    | latin1                                                  |  
| character_set_server     | utf8                                                    |  
| character_set_system     | utf8                                                    |  
| character_sets_dir       | F:\Program Files\MySQL\MySQL Server 5.5\share\charsets\ |  
+--------------------------+---------------------------------------------------------+  

5. it can get correct value when querying from command console:

mysql> select * from tb1;  
+-------+-------+  
| vname | cname |  
+-------+-------+  
| 名字v     | 名字c     |  
+-------+-------+  
1 row in set (0.00 sec)  

6. can't get the correct valuing when using JDBC to query:
jdbc url : jdbc:mysql://192.168.5.74/testdb?characterEncoding=UTF-8

*set names utf8;  
select vname,hex(vname),length(vname),char_length(vname) from tb1;  
select cname,hex(cname),length(cname),char_length(cname) from tb1;  
select vname,cname  
    ,CONVERT(CONVERT(CONVERT(vname USING latin1) USING binary) USING utf8) as c1  
    ,CONVERT(CONVERT(CONVERT(cname USING latin1) USING binary) USING utf8) as c2  
from tb1;*  

vname           hex(vname)     length(vname) char_length(vname)   
--------------- -------------- ------------- ------------------   
??×?v           C3FBD7D676     5             5                    

cname           hex(cname)     length(cname) char_length(cname)   
--------------- -------------- ------------- ------------------   
??×?c           C3FBD7D663     5             5                    

vname             cname     c1      c2      
----------------- --------- ------- ------  
??×?v             ??×?c                 
Geln Yang
  • 902
  • 2
  • 20
  • 36

3 Answers3

4

I resolve it. The multi-bytes characters are converted to bytes when storing in latin1 field. It need to convert back to the inserting charset. The following sql can do this:

CONVERT(CONVERT(CONVERT(vname USING latin1) USING binary) USING [INSERT_CHARSET]) 

My inserting charset is gb2312, so the sql should be :

CONVERT(CONVERT(CONVERT(vname USING latin1) USING binary) USING gb2312) 
Geln Yang
  • 902
  • 2
  • 20
  • 36
2

You should change latin1_general_ci; to utf8_general_ci or utf16_general_ci. The problem is CHARSET=latin1 cannot store unicode characters like Chinese characters.

Check out this: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html

The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

Basic Latin letters, digits, and punctuation signs use one byte.

Most European and Middle East script letters fit into a two-byte sequence: 

extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

**Korean, Chinese, and Japanese ideographs use three-byte sequences.**

Latin1 charset is only 8bit single byte while Chinese script require multi-bytes

hungneox
  • 9,333
  • 12
  • 49
  • 66
  • Why does it can get correct value when querying from command console? AS I know, latin1 can store any kind of data include binary . If the character can't be changed , is there any other way? – Geln Yang Dec 23 '11 at 10:48
  • @GelnYang please try use phpMyAdmin, command line font cannot decode unicode characters – hungneox Dec 23 '11 at 11:16
  • @GelnYang latin1 dose not support character set for Chinese character :( http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html check out this – hungneox Dec 23 '11 at 11:19
  • changing to utf8_general_ci must be the best solution. But we can't change customer's DB. I doubt the data stored being corrupted, but it can get the correct value to query using MYSQL ODBC. So the data stored is just latin format and needed to convert to the right format when using JDBC. But I don't know how? – Geln Yang Dec 25 '11 at 05:08
  • @GelnYang how you display your data? JLabel? – hungneox Dec 25 '11 at 09:58
  • just print out the data into system console. I find the HEX value is equal to the URL-Encoded value(removing char %), maybe the ODBC does some Decoding things. But there isn't system function found to do this. – Geln Yang Dec 25 '11 at 15:55
  • be aware that system console font might not displays correctly chinese. try to write it to a file. – hungneox Dec 25 '11 at 15:58
0

Another, simpler way of getting the desired characters would be to get the bytes from the column and then convert those into a string inside the application.

Pseudocode...

byte[] rawBytes = resultSet.getBytes( "vname" );
String vname = new String( rawBytes, "gb2312" );

A very useful way to check and see if mysql is showing you the correct string in the monitor by accident or design is to use the length functions.

select vname, length( vname ) as bytelength, char_length( vname ) as stringlength from tb1;

In addition, the hex() function is also your friend:

select vname, hex( vname ) from tb1;

That'll show you the hex values of the bytes that are stored in the column so you can then look those up (yay, Wikipedia!) to see if they match the displayed value in the various different ways of viewing it.

chooban
  • 9,018
  • 2
  • 20
  • 36