1

Disclaimer:

  1. Database is ut8mb4_unicode_520_ci
  2. Table field is ut8mb4_unicode_520_ci

How do you correctly query a table field that contains dakuten or handakuten Japanese characters? Dakuten.

Currently, it seems that the base character is returned, even when the query is ran for the tenten version.

Example Data

Given and . And a row with ID: 199, post_title: 'へ';

Scenario 1

Run:

SELECT 'へ' = 'ぺ'; 

-- Returns 0. Correct

Scenario 2

Run:

SELECT ID, post_title 
FROM wp_posts 
WHERE post_title = 'へ';

-- Returns row 199. Correct

Scenario 3

But, for some reason, when I run this query, it still returns record 199, noting the different title value.

Run:

SELECT ID, post_title 
FROM wp_posts 
WHERE post_title = 'ぺ';

-- Returns row 199. Incorrect

Example Image

An image would explain better (I'm just using union to better diplay everything in one screenshot):

enter image description here

Is there a solid approach to working with these characters? All other Japanese characters seem to work fine, its just the dakuten versions are treated like their bases in queries only.

Sebastian Brosch
  • 42,106
  • 15
  • 72
  • 87
Chris
  • 54,599
  • 30
  • 149
  • 186
  • are you using proper collation? – Rahul Dec 12 '16 at 18:00
  • I'm reading in to it now. It seems utf general is case sensitive, and utf unicode is case insensitive for Japanese characters. Running tests now to see if swapping works – Chris Dec 12 '16 at 18:05
  • There is something funky with certain unicode collations I think: https://bugs.mysql.com/bug.php?id=79977 – Chris Dec 13 '16 at 04:49
  • `HEX('へぺ')` is `E381B8 E381BA`; is that what you have? – Rick James Dec 14 '16 at 00:44
  • I can confirm swapping it from `520` to `utf8_general_ci` restores the expected behaviour. But I don't feel like that is a great solution.. – Chris Dec 14 '16 at 00:47
  • general_ci is much older; even 520 is not the latest Unicode standard. – Rick James Feb 11 '23 at 19:43

2 Answers2

3

This is because the collation you used (utf8mb4_unicode_ci, utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci) only compares character's base letter. For example, 'ぺ' = 'へ' + U+309A ◌゚, 'へ' is the base letter of 'ぺ'. So for your case, all 3 characters' base letter is same, 'へ'. So it is correct result for those collations return '1'.

MySQL team is developing a new Japanese collation for utf8mb4 character set. It will differentiate these dakuten characters from base character. It will come soon.

Xing Zhang
  • 71
  • 1
  • Thanks @Xing Zhang. Is there any documentation on that? Or reference material? – Chris Dec 14 '16 at 09:29
  • Thanks. I assume "come soon" means 8.0.x and not 5.7.xx? – Rick James Dec 14 '16 at 19:48
  • @Chris. Please refer to A.11.13 section in [link](http://dev.mysql.com/doc/refman/5.7/en/faqs-cjk.html). It explains how these collations compare Japanese character KA and GA. I think it is same as your question. – Xing Zhang Dec 15 '16 at 02:44
  • @RickJames. I suppose it is 8.0.1+. – Xing Zhang Dec 15 '16 at 02:45
  • @XingZhang A.11.13 is exactly the answer. I wasn't aware of the termprimary weight in mysql. So it seems like I have two options - revert to _bin or _general (a non-UCA) OR wrap queries in hex - like: `SELECT HEX('へ') = HEX('ぺ') COLLATE utf8mb4_unicode_520_ci;` – Chris Dec 15 '16 at 03:10
  • @XingZhang Btw - great read up. The entire doc you linked is really interesting for the CJK in general – Chris Dec 15 '16 at 03:11
  • If you are going to compare HEX to HEX, you probably do not need to specify a collation; if you do, it may as well be `ascii_ci`. – Rick James Dec 15 '16 at 04:44
1
SELECT 'へ' = 'ぺ' COLLATE utf8mb4_unicode_ci; --> 0  (ditto for general_ci)
SELECT 'へ' = 'ぺ' COLLATE utf8mb4_unicode_520_ci; --> 1

The latter is a newer Unicode standard, so it is, in theory, more correct.

But what are you really doing? Probably comparing one column to another? Are they both utf8mb4_unicode_520_ci? (The database and the connection don't matter.)

Or is one side of = a column and the other is a literal?

Do you establish the collation when connecting?

Addenda

In version 8.0.0, all of these give 1:

utf8mb4_unicode_ci  -- a change from 0 in 5.6.12, but 1 in 5.7.15?
utf8mb4_unicode_520_ci
utf8mb4_0900_ai_ci
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    I see nothing in the changelogs about this issue. If you want more satisfaction, go to http://bugs.mysql.com . – Rick James Dec 14 '16 at 01:29
  • Yes I am seeing all 3 return `1`. Thank you for taking the time to answer. To answer some of your questions - the context is a `WHERE` clause, column against a string literal. Yes I am setting the collation on upon connecting. In terms of the 'more satisfaction' - do you mean the question isn't really appropriate for SO? And should be moved to bugs.mysql.com ? In otherwords, are you leaning towards this being a bug (I'm always hesitant to create noise on bug trackers unless its really a bug). Thanks again – Chris Dec 14 '16 at 02:36
  • Second clarification: I'm on mysql `v.5.7.16`. – Chris Dec 14 '16 at 02:48
  • 1
    If MySQL is correctly following the Unicode specifications (which I cannot verify), then the bug might be against Unicode (which I have not studied) instead of MySQL. – Rick James Dec 14 '16 at 03:06
  • Even newer: `SELECT 'へ' = 'ぺ' COLLATE utf8mb4_0900_ai_ci;` --> 1. – Rick James Feb 04 '23 at 19:59
  • Some limited experiments show that `LIKE` matches but `REGEXP` does not. – Rick James Feb 04 '23 at 20:00