2

I will try to explain my problem. I am working with Czech special letters in MySQL database, such as ěščřžýáíé, etc. I was having problem when searching in database, because I was searching for exactly the word, which I knew was in the database, but no results were found.

Later I discovered, that somehow I have two "versions" of characters, which look exactly the same, but they are different in Unicode. Please take a look at example here. Both "" and "á" look the same and should be treated as the same. So now I need a way how to convert one to other. Preferably using MySQL or PHP.

enter image description here

As you can see on above picture, the first character is actually made of two characters "a" and "´". The second characters is just "á" as single character and that is what I would like to convert the first one to.

Any ideas how to handle this ? Of course, I am talking about all the Czech characters, not only "á", I used it only for example.

Frodik
  • 14,986
  • 23
  • 90
  • 141
  • I think `replace(letters, 'á', 'á')` would work. http://sqlfiddle.com/#!9/0f66ba/4 or do you have entities in the DB? – chris85 Jun 18 '16 at 14:56
  • @chris85 I was kind of looking for some more "universal" solution. But I don't know if it even exists ? – Frodik Jun 18 '16 at 17:25
  • Oh, so not specific to `á` but all combined characters..yea not sure about that one. That'd be a bit more difficult. I'd work on the PHP side for that. – chris85 Jun 18 '16 at 17:32
  • Yes, exactly. Any ideas how to do it ? – Frodik Jun 18 '16 at 19:05

1 Answers1

3

What you're describing is Unicode canonical equivalence.

I figured using utf8mb4_unicode_ci collation would solve this for you. However the documentation implied that it would not:

A combined character will be considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the CHAR_LENGTH() function or in result set metadata).

However a quick test seems to indicate that is incorrect:

mysql -u root -e "SELECT 'a<0301>' = 'á' COLLATE utf8mb4_unicode_ci;"
+-----------------------------------------+
| 'á' = 'á' COLLATE utf8mb4_unicode_ci    |
+-----------------------------------------+
|                                       1 |
+-----------------------------------------+

Confusing.. though I wonder if that sentence is only applies in the context of the previous two sentences:

Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo.

So anyway, that may work for you. It is worth noting that utf8mb4_unicode_ci will result in relatively loose matching, e.g. á and a will be treated equivalent:

mysql -u root -e "SELECT 'á' = 'a' COLLATE utf8mb4_unicode_ci;"
+---------------------------------------+
| 'á' = 'a' COLLATE utf8mb4_unicode_ci  |
+---------------------------------------+
|                                     1 |
+---------------------------------------+

Another option, should you wish to have finer control on this, is to normalize text before insert into your database (intl extention required). Whether or not you'll want to do this depends on how interested you are in keeping it in it's absolute original form. The normalization process guarantees visual equivalence, so it should be safe to apply. For example, if you were to normalize to the composed form (which would be most storage efficient, should you care):

<?php

$a = 'á'; // 0xC3 0xA1
$b = 'á'; // 0x61 0xCC 0x81

$ca = \Normalizer::normalize($a, \Normalizer::FORM_C);
$cb = \Normalizer::normalize($b, \Normalizer::FORM_C);

$da = \Normalizer::normalize($a, \Normalizer::FORM_D);
$db = \Normalizer::normalize($b, \Normalizer::FORM_D);

var_dump($a === $b); // FALSE
var_dump($a === $cb); // TRUE, $a is already composed
var_dump($ca === $cb); // TRUE, $a is unchanged by normalizer
var_dump($b === $da); // TRUE, $b is already decomposed
var_dump($db === $da); // TRUE, $b is unchanged by normalizer
nj_
  • 2,219
  • 1
  • 10
  • 12
  • Thanks for great answer. However, I wanted to try simple SELECT with `COLLATE utf8_unicode_ci` but I got this error: `COLLATION 'utf8_unicode_ci' is not valid for CHARACTER SET 'utf8mb4'` which I have no idea what it means :-) Can you help ? – Frodik Jun 19 '16 at 13:59
  • Oh, yes, use `utf8mb4_unicode_ci` then. This is because you are connecting using character set = `utf8mb4`. I'll edit my answer to use `utf8mb4_unicode_ci` actually, `utf8mb4` should generally be used over `utf8`. By the way, should you wish to use collations to solve this, you can set it at a [column level](http://dev.mysql.com/doc/refman/5.7/en/charset-column.html), which would mean manually specifying it would be unnecessary. – nj_ Jun 19 '16 at 22:31
  • Thak you for explanation. However I have another strange problem with your answer. When I run `SELECT 'a' = 'á' COLLATE utf8mb4_unicode_ci` it returns `1` as your example. But when I run `SELECT * FROM table WHERE name LIKE '%maslo%' COLLATE utf8mb4_unicode_ci` it doesn't return any of the rows which contain the "double á" character, only the ones which contain "single á" character. Could you please advice any further ? – Frodik Jun 20 '16 at 04:56
  • 1
    I think you're suffering from [the issue described here](http://stackoverflow.com/questions/7917367/mysql-why-are-collation-rules-ignored-by-like-operator-for-german-%C3%9F-character/7918413). TLDR; `LIKE` behaves differently. I tested things locally and got the same result as you. However, using the normal `=` comparison operator, things worked as expected. If you want `LIKE` to work, normalizing may be your only option. – nj_ Jun 20 '16 at 14:32