4

I'm observing some strange MySQL behavior for VARCHAR fields that contain a single emoji. The connection, database and column are all utf8mb4 encoding and collation.

There's a table called categories with id and title fields. There's three rows, with titles of , and .

When I run a query looking for a row with the title , I get all the other rows with single emojis as well, as if MySQL is treating them as the same characters.

Query for credit card emoji returning all other single emoji rows

The expected behaviour should be that only the row with is returned.

maknz
  • 3,746
  • 3
  • 24
  • 21

2 Answers2

2

The collation determines whether two characters are considered to be equal or not.

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

If you want only an exact binary match, you might want to specify a binary collation.

Some syntax examples here for latin1.

https://dev.mysql.com/doc/refman/5.7/en/case-sensitivity.html

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Indeed, the problem seems to be with `utf8mb4_general_ci`, whereas `utf8mb4_unicode_ci` works with emoji as expected. – maknz Mar 09 '17 at 23:04
  • 1
    @maknz -- something wrong with your test; unicode is not adequate; you need unicode_520; see my answer. – Rick James Mar 13 '17 at 22:44
  • Using `utf8mb4_unicode_ci` fixes the problem, what does 520 do that unicode itself doesn't? – maknz Mar 14 '17 at 22:58
  • 1
    @maknz: The **`_520_`** in the collation name indicates that this is an implementation of version 5.2.0 or the Unicode Collation Algorithm (UCA). The names without `_520_` implement an earlier version (UCA 4.0.0). MySQL also has collations that are backwards compatible with collations implemented in earlier versions of MySQL – spencer7593 Mar 15 '17 at 14:58
  • Okay, I guess I'm just stumped as to why the regular `unicode_ci` fixes my example above, when it's not.. meant to? Are there any specific cases where `unicode_520_ci` will work better than `unicode_ci`, or is it just better practice? – maknz Mar 16 '17 at 02:47
  • Ah.. actually, it's probably because I'm using Percona's MySQL fork (5.6). Can't find any literature on any Unicode differences to MySQL, but that could be it. – maknz Mar 16 '17 at 03:47
  • I have not heard of the forks differing in collations. `520` distinguishes some characters that used to be treated equal. Emoji is one set of such. Among "letters", several are noted in [_here_](http://mysql.rjweb.org/utf8_collations.html), such as `ae=Æ=æ < az` instead of `ae < az < Æ=æ`. See also `Ď,Ł,Ø` – Rick James Mar 17 '17 at 05:25
  • Thanks for the clarification. I've marked your answer as correct given it's the best solution for emoji support. – maknz Mar 23 '17 at 02:40
1

You need the collation utf8mb4_unicode_520_ci; the older collations treat Emoji as equal.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • For a peek into MySQL 8.0, with Unicode 9.0, see `utf8mb4_0900_ai_ci` in [_here_](http://mysql.rjweb.org/utf8mb4_collations.html). – Rick James Mar 17 '17 at 05:27