3

I am using MySQL 5.6.27, Java Connector 5.1.36 on Linux, and I have problem with some of the Serbian/Croatian/Slovenian characters.

Database is started with

./bin/mysqld_safe --user=mysql --bind_address=localhost --character-set-server=utf8 &

Database is created with

-- CREATE USER 'my_test'@'localhost' IDENTIFIED BY 'my_test';
-- CREATE DATABASE my_test DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
-- GRANT ALL PRIVILEGES ON my_test.* TO 'my_test'@'localhost' IDENTIFIED BY 'my_test';
USE my_test;

CREATE TABLE proba
(
    content TEXT NOT NULL
) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';

INSERT INTO proba(content)
VALUES ('markovič marko SURČIN');

INSERT INTO proba(content)
VALUES ('Nikolić Nikola Ćićevac');   

INSERT INTO proba(content)
VALUES ('petroviš đura Đeram');

INSERT INTO proba(content)
VALUES ('Milošević Miloš Šabac');

INSERT INTO proba(content)
VALUES ('jovanović žarko Žarkovo');

This dump is imported into MySQL with

/usr/local/mysql-5.6.27/bin/mysql --user=my_test --password < schema.sql

Java client fetches data with

public class Serbian
{
    public static void main(String[] args) throws ClassNotFoundException, SQLException
    {
        Connection conn = null;
        try
        {
            Class.forName("com.mysql.jdbc.Driver");
            conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/my_test?user=my_test&password=my_test&useUnicode=true&characterEncoding=UTF-8&collation=utf8_unicode_ci");

            Statement stmt = conn.createStatement();
            ResultSet rs = stmt.executeQuery("SELECT content FROM proba");
            while(rs.next())
            {
                String s = rs.getString("content");
                System.out.println(s);
            }
            rs.close();
        }
        catch (SQLException exc)
        {
            exc.printStackTrace();
        }
        finally
        {
            conn.close();
        }
    }
}

The result is

markovič marko SUR??IN
Nikoli?? Nikola ??i??evac
petroviš ??ura Đeram
Miloševi?? Miloš Šabac
jovanovi?? žarko Žarkovo

(question marks are actually invalid characters reported by editor). In other words, letters Č, Ć, ć, đ are invalid while Ž, ž, č, Š, š, Đ are valid. It's weird that UTF8 partially works. Is there something that I should give a try or this seems to be a MySQL/Java Connector problem?

karastojko
  • 1,156
  • 9
  • 14
  • This is the console output, but could you try writing this in a file instead ? – Arnaud Dec 18 '15 at 13:35
  • Actually, it's written into a file, not the console. – karastojko Dec 18 '15 at 13:37
  • Hmmm - you have **two** questionmarks where **one** character should be - this feels like edtitor tries to display ISO-* code. Can you give hex decode of that file or use different editor maybe (whare do you use now?) – Jan Dec 18 '15 at 13:44
  • Make sure that the encoding is being done properly by the database, the file, and the console. Right now I don't believe you know where it's being lost. – duffymo Dec 18 '15 at 13:48
  • 1
    @Jan: Here is the hex dump: `6d 61 72 6b 6f 76 69 c4 8d 20 6d 61 72 6b 6f 20 53 55 52 c4 3f 49 4e 0a 4e 69 6b 6f 6c 69 c4 3f 20 4e 69 6b 6f 6c 61 20 c4 3f 69 c4 3f 65 76 61 63 0a 70 65 74 72 6f 76 69 c5 a1 20 c4 3f 75 72 61 20 c4 90 65 72 61 6d 0a 4d 69 6c 6f c5 a1 65 76 69 c4 3f 20 4d 69 6c 6f c5 a1 20 c5 a0 61 62 61 63 0a 6a 6f 76 61 6e 6f 76 69 c4 3f 20 c5 be 61 72 6b 6f 20 c5 bd 61 72 6b 6f 76 6f 0a` @duffymo: I believe that I've set everything (see the post), the weird part is that it works partially. If it wasn't set, I guess none of the non-latin characters would work. – karastojko Dec 18 '15 at 15:06
  • Have you tried the suggestions from the similar question [here](http://stackoverflow.com/q/13234433/2144390)? – Gord Thompson Dec 18 '15 at 15:20
  • @GordThompson Yes, then I get no Slavic characters at all, just plain question marks: `markovi? marko SUR?IN Nikoli? Nikola ?i?evac petrovi? ?ura ?eram Milo?evi? Milo? ?abac jovanovi? ?arko ?arkovo` – karastojko Dec 18 '15 at 15:27
  • 2
    Your letter Č shows up as c4 3f which is 쐿 in utf-8. There has to be something broken *before* you retrieve that string. How did you get the bytes? – Jan Dec 18 '15 at 15:28
  • I am executing the Java code given in the post with `java -cp ".:./mysql-connector-java-5.1.36-bin.jar" Serbian > java.log` – karastojko Dec 18 '15 at 15:30
  • Which takes into account your local charset. I think it'd be best if you actually wrote the bytes you recieve from DB into file. By the way: rs.getString("content"); already does some conversion – Jan Dec 18 '15 at 15:32
  • What is the character encoding of your .java source file? – Gord Thompson Dec 18 '15 at 18:14
  • *"Actually, [the output is] written into a file, not the console."* - Is it stdout (System.out) output that is being *redirected* to a file? If so, then it may suffer the same limitations as console output that is not redirected. – Gord Thompson Dec 18 '15 at 18:22
  • This question is similar to http://stackoverflow.com/questions/31700235/java-server-pages-collation-when-inserting-into-mysql – Rick James Dec 19 '15 at 20:27
  • 1
    Looking at the middle of that hex: `53 55 52 c4 3f 49 4e` -- Assuming it is latin1, it says `SURÄ?IN`. In utf8 the hex should be `53 55 52 C48C 49 4E`. Because of the `?`, I suspect you have made attempts to fix it. But in reality just made things worse. – Rick James Dec 19 '15 at 20:35
  • The database creation script needed `SET NAMES utf8;`. – Rick James Dec 19 '15 at 20:40
  • As described in the answer to myself below, `SET NAMES utf8` does not make any change. Importing schema to MySQL required the switch `--default-character-set=utf8`, otherwise some characters are corrupted. – karastojko Dec 19 '15 at 20:54

2 Answers2

1

Comment field too short. Not yet a "final" answer.

Where your letter Č should be hexdump shows up as c4 3f which is not valid utf-8.

Whereas

for(byte b : "Č".getBytes("UTF-8")) {
   System.out.println("-> " + Integer.toHexString(b));
}

Results in

-> ffffffc4
-> ffffff8c

And \uc48c is correctly printed as Č again.

You could check if moving from getString() to getNString() changes result.

Wait a second

There's 'c4 3f' everywhere some of your "missing" characters should be. The hex is already broken.

Jan
  • 13,738
  • 3
  • 30
  • 55
  • I tried also `getNString()` and works in same way. If hex is already broken, then it's a problem at Connector or MySQL side? – karastojko Dec 18 '15 at 15:57
  • It might still be the output into file / system.out try writing bytes of string you read from db – Jan Dec 18 '15 at 15:58
  • In mysql you can see names as they're supposed to be? – Jan Dec 18 '15 at 16:00
  • In MySQL I don't see any of those characters, but that's okay because I did not set console charset. But I do have the same problem with the webpage. The given code is result of the investigation while working on Servlet/JSP/Tomcat application. I solved it by keeping Base64 encoded characters in MySQL. But I am curious about the root cause of the problem. – karastojko Dec 18 '15 at 16:07
  • 1
    Well I suppose you'll have to do that onion-wise: layer by layer. Check bytes in mysql. See what bytes are received in getString () and so on. Have you tried inserting the strings from java? – Jan Dec 18 '15 at 16:10
  • 2
    I looks like you might be mixing up UTF-8 encodings and Unicode code points. `C4 3F` is actually not a valid UTF-8 encoding because the second byte starts with the bits `00`, not `10`. And `Č` is indeed UTF-8 encoded as `C4 8C` but that's not the same as the Unicode code point `\uC48C` which is `쒌` and UTF-8 encoded as `EC 92 8C`. – Gord Thompson Dec 18 '15 at 18:36
  • That totally possible @GordThompson. However it doesn't much affect the problem. The bytes received from the db are no longer what they were before and hace c4 3f instead. – Jan Dec 18 '15 at 18:53
  • Agreed. I just thought you might want to tweak your answer to try and avoid any (further) confusion for others who might be trying to make sense of all this. :) – Gord Thompson Dec 18 '15 at 19:18
  • @GordThompson which I did :-) Thanks. – Jan Dec 18 '15 at 19:24
0

Here is the solution. The SQL file remains same, although adding

SET NAMES 'utf8' COLLATE 'utf8_general_ci';

doesn't hurt. Importing dump has to be changed by adding the switch --default-character-set=utf8:

/usr/local/mysql-5.6.27/bin/mysql --user=my_test --password --default-character-set=utf8 < schema.sql

Without this parameter, hex string in the query (SELECT hex(content) FROM proba;) shows corrupted hex string with the non-existing UTF8 characters. Finally, Java code has to deal with UTF8, so in the while loop the following line fetches the UTF8 string:

System.out.println(new String(rs.getString("content").getBytes(Charset.forName("UTF-8"))));

Execution of the program and redirecting to file (java -cp ".:./mysql-connector-java-5.1.36-bin.jar" Serbian > java.log) does not make trouble, so everything looks fine:

markovič marko SURČIN
Nikolić Nikola Ćićevac
petroviš đura Đeram
Milošević Miloš Šabac
jovanović žarko Žarkovo

@Jan, @GordThompson: thanks for help in better understanding the problem.

karastojko
  • 1,156
  • 9
  • 14