I am parsing a bunch of XML files and inserting the value obtained from them into a MySQL database. The character set of the mysql tables is set to utf8. I'm connecting to the database using the following connection url - jdbc:mysql://localhost:3306/articles_data?useUnicode=false&characterEncoding=utf8
Most of the string values with unicode characters are entered fine (like Greek letters etc.), except for some that have a math symbol. An example in particular - when I try to insert a string with mathematical script capital g (img at www.ncbi.nlm.nih.gov/corehtml/pmc/pmcents/1D4A2.gif) ( http://graphemica.com/ ) (Trying to parse and insert this article), I get the following exception -
java.sql.SQLException: Incorrect string value: '\xF0\x9D\x92\xA2 i...' for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3515)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3447)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1951)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2101)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2554)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1761)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2046)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1964)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1949)
If I change my connection URL to - jdbc:mysql://localhost:3306/articles_data, then the insert works, but all regular UTF8 characters are replaced with a question mark.
There are two possible ways I'm trying to fix it, and haven't succeeded at either yet -
When parsing the article, maintain the encoding. I'm using
org.apache.xerces.parsers.DOMParser
to parse the xml files, but can't figure out how to prevent it from decoding (relevant XML -<p>𝒢 is a set containing...</p>
). I could re-encode it, but that just seems inefficient.Insert the math symbols into the database.