0

I have a database with utf8mb4 character set and Ive been trying to select records where data are in Japanese characters but the utf8mb4_general_ci collation can't seem to order the result correctly. for example , the below characters are showing in desc order:

小松

児玉

when it should be:

児玉

小松

Fancy
  • 135
  • 12
  • MySQL 5.5 is pretty old. All kinds of support have ended nearly three years ago. Have you checked whether updating to a supported version of MySQL resolves the problem? – Nico Haase Oct 18 '21 at 09:26
  • Unfortunately, that's our restriction , we are not allowed to upgrade it. – Fancy Oct 18 '21 at 09:34
  • Do you know what is with utf8mb4_general_ci that makes it unable to sort Japanese characters? or any articles which talks about this? – Fancy Oct 18 '21 at 09:35
  • 1
    @Fancy did you check: https://dev.mysql.com/doc/refman/8.0/en/faqs-cjk.html ? – Ergest Basha Oct 18 '21 at 09:48
  • Thanks @ErgestBasha yes i found the same FAQs for 5.5 version, very helpful – Fancy Oct 18 '21 at 10:01

1 Answers1

1

Okay , For others who might stamble upon this same situation . Here is the solution i found. So , i read this article

MySQL 5.5 FAQ: MySQL Chinese, Japanese, and Korean Character Sets

I'll post here snippet of what it says:

    mysql> SELECT * FROM tj WHERE s1 = 'か';
    +------+
    | s1   |
    +------+
    | が  |
    | か  |
    +------+
2 rows in set (0.00 sec)

The character in the first result row is not the one that we searched for. Why did MySQL retrieve it? First we look for the Unicode code point value, which is possible by reading the hexadecimal number for the ucs2 version of the characters:

mysql> SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1   | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が  | 304C                        |
| か  | 304B                        |
+------+-----------------------------+
2 rows in set (0.03 sec)

Now we search for 304B and 304C in the 4.0.0 allkeys table, and find these lines:

304B ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA

304C ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA;

QQCM The official Unicode names (following the “#” mark) tell us the Japanese syllabary (Hiragana), the informal classification (letter, digit, or punctuation mark), and the Western identifier (KA or GA, which happen to be voiced and unvoiced components of the same letter pair). More importantly, the primary weight (the first hexadecimal number inside the square brackets) is 1E57 on both lines. For comparisons in both searching and sorting, MySQL pays attention to the primary weight only, ignoring all the other numbers. This means that we are sorting が and か correctly according to the Unicode specification. If we wanted to distinguish them, we'd have to use a non-UCA (Unicode Collation Algorithm) collation (utf8_bin or utf8_general_ci), or to compare the HEX() values, or use ORDER BY CONVERT(s1 USING sjis). Being correct “according to Unicode” isn't enough, of course: the person who submitted the bug was equally correct. We plan to add another collation for Japanese according to the JIS X 4061 standard, in which voiced/unvoiced letter pairs like KA/GA are distinguishable for ordering purposes.

As what the article suggested, I used " ORDER BY CONVERT(s1 USING sjis)" as a workaround to resolve my problem.

Fancy
  • 135
  • 12