0

I have non-ASCII characters in keys in my database that are supposed to be UTF-8 (utf8mb4) encoded. And they sort of are, but not entirely:

From Java via JDBC (mysql:mysql-connector-java:8.0.21), a:

SELECT * FROM whatever WHERE id="Dümmer";

produces the right row. However, the value of the id column is returned as:

Dümmer

Here is what I know:

  • in the mysql console, the value is printed correctly.
  • the column has been created with:
    `id` varchar(511) CHARACTER SET utf8mb4 COLLATE utf8mb4_nopad_bin NOT NULL,
    
  • the com.mysql.cj.jdbc.MysqlDataSource has a null character encoding. I have tried .setCharacterEncoding( "UTF-8" ) but that does not seem to make a difference.

Where else should I look? This is Linux, Mariadb 10.5.5, running client and server on the same machine.

Johannes Ernst
  • 3,072
  • 3
  • 42
  • 56
  • 2
    In what context is it 'returned'. It looks like a bunch of UTF-8 bytes, but rendered in a place that thinks it's latin-1. How/where are you looking at the output? – Evert Apr 14 '21 at 05:16
  • I'm looking at it straight after the JDBC API call in the debugger, because my sanity check (`idInQuery.equals(idReceivedFromQuery)`) balked at me. – Johannes Ernst Apr 14 '21 at 22:11
  • Where is your debugger running? In a terminal? What encoding does it use? Possibly run `echo $LANG` – Evert Apr 15 '21 at 01:13
  • It's not a debugger problem. I have both `idInQuery` and `idReceivedFromQuery` right next to each other in various places (debugger, my app) and they differ. – Johannes Ernst Apr 15 '21 at 19:21
  • I'm not saying it's a problem, I'm saying it might use a different encoding. Even if the first one 'looks right', it could still be in in `ISO-8559-1` encoding and your input could be wrong. Understanding whether your input or output is wrong is valuable to debug this. – Evert Apr 15 '21 at 19:29
  • I'm now single-stepping through the mysql connector, and it appears that the server represents the `id` column as "binary", which then is mapped by the connector to the first encoding that it knows of, which happens to be ISO-8859-1. I can't claim I understand why it might be doing either, but decoding my UTF-8 string as ISO-8859-1 would explain what I'm seeing. (Stopping my investigation here, time for a workaround.) – Johannes Ernst Apr 15 '21 at 21:15

0 Answers0