3

I am parsing a bunch of XML files and inserting the value obtained from them into a MySQL database. The character set of the mysql tables is set to utf8. I'm connecting to the database using the following connection url - jdbc:mysql://localhost:3306/articles_data?useUnicode=false&characterEncoding=utf8

Most of the string values with unicode characters are entered fine (like Greek letters etc.), except for some that have a math symbol. An example in particular - when I try to insert a string with mathematical script capital g (img at www.ncbi.nlm.nih.gov/corehtml/pmc/pmcents/1D4A2.gif) ( http://graphemica.com/ ) (Trying to parse and insert this article), I get the following exception -

java.sql.SQLException: Incorrect string value: '\xF0\x9D\x92\xA2 i...' for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3515)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3447)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1951)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2101)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2554)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1761)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2046)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1964)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1949)

If I change my connection URL to - jdbc:mysql://localhost:3306/articles_data, then the insert works, but all regular UTF8 characters are replaced with a question mark.

There are two possible ways I'm trying to fix it, and haven't succeeded at either yet -

  1. When parsing the article, maintain the encoding. I'm using org.apache.xerces.parsers.DOMParser to parse the xml files, but can't figure out how to prevent it from decoding (relevant XML - <p>&#x1d4a2; is a set containing...</p>). I could re-encode it, but that just seems inefficient.

  2. Insert the math symbols into the database.

Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
Shashank Agarwal
  • 2,769
  • 1
  • 22
  • 24
  • Please check the encoding of your xml-files. Is that UTF-8 as well? And on which operating system is your app running? – powerMicha Jul 14 '11 at 15:15
  • Why `useUnicode=false` in url? – Alex Gitelman Jul 14 '11 at 15:26
  • @Alex: I tried `useUnicode=true` first, didn't make a difference. I turned it to false expecting that it wouldn't force trying to insert it as utf8 characters and would fallback to question marks where it didn't like it. @powerMicha: I'm testing it on Windows right now. The XML header doesn't give any information about the encoding of the file, but when I parse it, it converts the encoded characters to (I'm assuming) UTF-8. I'm assuming that because if I stdout the parsed string and look at it in NP++ with UTF-8 encoding, it appears correctly. – Shashank Agarwal Jul 14 '11 at 15:28
  • Looks like '\xF0\x9D\x92\xA2' is the correct utf-8 encoding of \U01d4a2, so the problem seems to be on the mysql side. Which version of the DB and jdbc driver are your using? – Jörn Horstmann Jul 14 '11 at 15:30
  • @Jörn: I'm on Server 5.5.11. I'm using `mysql-connector-java-5.1.7-bin.jar` to connect. – Shashank Agarwal Jul 14 '11 at 15:36

1 Answers1

6

MySQL up to version 5.1 seems to only support unicode characters in the basic multilingual plane, which when encoded as utf-8 take no more than 3 bytes. From the manual on unicode support in version 5.1:

MySQL 5.1 supports two character sets for storing Unicode data:

  • ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
  • utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character

In version 5.5 some new character sets where added:

...

  • utf8mb4, a UTF-8 encoding of the Unicode character set using one to four bytes per character

ucs2 and utf8 support BMP characters. utf8mb4, utf16, and utf32 support BMP and supplementary characters.

So if you are on mysql 5.1 you would first have to upgrade. In later versions you have to change the charset to utf8mb4 to work with these supplementary characters.

It seems the jdbc connector also requires some further configuration (From Connector/J Notes and Tips):

To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.

Jörn Horstmann
  • 33,639
  • 11
  • 75
  • 118
  • The server is a 5.5! I changed the encoding of the table to utfmb4, but still got the error. I switched connection url to `jdbc:mysql://localhost:3306/articles_data?useUnicode=false&characterEncoding=utf8mb4`, but that gives me an `Unsupported character encoding 'utf8mb4'.` error. – Shashank Agarwal Jul 14 '11 at 16:10
  • 1
    @Shashank Agarwal, This tip for the jdbc connector might also be relavant: http://download.oracle.com/docs/cd/E17952_01/refman-5.1-en/connector-j-usagenotes-troubleshooting.html#qandaitem-21-3-5-3-1-15 – Jörn Horstmann Jul 14 '11 at 16:24
  • That tip fixes it! Thanks :). – Shashank Agarwal Jul 14 '11 at 17:09
  • @ShashankAgarwal how did you fix the `Unsupported character encoding 'utf8mb4'` issue? – Luke Sep 14 '13 at 21:40
  • @Luke, it was a while back so I don't remember exactly what worked, but after configuring `character_set_server=utf8mb4`, don't set the `characterEncoding=utf8` parameter in your connection string; e.g. in my case the connection string would become `jdbc:mysql://localhost:3306/articles_data`. – Shashank Agarwal Sep 16 '13 at 13:48
  • I second this answer. The `character_set_server` property has to be set in the `/etc/mysql/my.cnf`. Restart the server and everything works fine. Nothing else has to be set anywhere else (no occurence of any character attribute in the `hibernate.cfg` or in any sql statement to create the database or the tables). – Torsten May 25 '14 at 08:55
  • @JörnHorstmann, could you please add the tip here. the link is not working? I've been struggling with this issues for two days ! – fattah.safa Oct 02 '16 at 04:00
  • 1
    @fattah.safa I fixed the link (again), the solution from that link was already quoted above – Jörn Horstmann Oct 02 '16 at 10:16