11

In a table x, there is a column with the values u and ü.

SELECT * FROM x WHERE column='u'.

This returns u AND ü, although I am only looking for the u.

The table's collation is utf8mb4_unicode_ci . Wherever I read about similar problems, everyone suggests to use this collation because they say that utf8mb4 really covers ALL CHARACTERS. With this collation, all character set and collation problems should be solved.

I can insert ü, è, é, à, Chinese characters, etc. When I make a SELECT *, they are also retrieved and displayed correctly.

The problem only occurs when I COMPARE two strings as in above example (SELECT WHERE) or when I use a UNIQUE INDEX on the column. When I use the UNIQUE INDEX, a "ü" is not inserted when I have a "u" in the column already. So, when SQL compares u and ü in order to decide whether the ü is unique, it thinks it is the same as the u and doesn't insert the ü.

I changed everything to utf8mb4 because I don't want to worry about character sets and collation anymore. However, it seems that utf8mb4 isn't the solution either when it comes to COMPARING strings.

I also tried this: SELECT * FROM x WHERE _utf8mb4 'ü' COLLATE utf8mb4_unicode_ci = column.
This code is executable (looks pretty sophisticated). However, it also returns ü AND u.

I have talked to some people in India and here in China about this issue. We haven't found a solution yet.

If anyone could solve the mystery, it would be really great.

Add_On: After reading all the answers and comments below, here is a code sample which solves the problem:

SELECT * FROM x WHERE 'ü' COLLATE utf8mb4_bin = column

By adding "COLLATE utf8mb4_bin" to the SELECT query, SQL is invited to put the "binary glasses" (ending _bin) on when it looks at the characters in the column. With the binary glasses on, SQL sees now the binary code in the column. And the binary code is different for every letter and character and emoji which one can think of. So, SQL can now also see the difference between u and ü. Therefore, now it only returns the ü when the SELECT query looks for the ü and doesn't also return the u.

In this way, one can leave everything (database collation, table collation) the same, but only add "COLLATE utf8mb4_bin" to a query when exact differentiation is needed.

(Actually, SQL takes all other glasses off (utf8mb4_german_ci, _general_ci, _unicode_ci etc.) and only does what it does when it is not forced to do anything additional. It simply looks at the binary code and doesn't adjust its search to any special cultural background.)

Thanks everybody for the support, especially to Pred.

Jakob
  • 111
  • 1
  • 5
  • You might want to read https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html – Mark Rotteveel Dec 12 '16 at 15:26
  • From your description, it seems you actually want to *ignore* collation and perform a binary match. What are your specific rules? E.g., should `u` and `U` be considered equal or different? – Álvaro González Dec 12 '16 at 15:47
  • Hi Alvaro. I have spent a lot of time with stackoverflow in the past. However, i have never posted a question. The responses are overwhelming. Stackoverflow seems to be a road to solutions. The best would be: During comparisons (WHERE or UNIQUE) everyhing which is displayed differently in rows is also differentiated during the comparison. So, ä is NOT a. a is NOT A. è is NOT e etc. If binary differentiates all of that, that would probably be the way to go. Can I just differentiate during the query and leave everything else the same? What would be a practical way to deal with this? – Jakob Dec 13 '16 at 05:59
  • Hi Alvaro. You also ask about rules: 1. It should be a predictable solution which I UNDERSTAND completely. 2. If possible, it should be working in many different environments. 3. If possible, it should be a simple solution. 4. If possible, it should be fast (not so important right now). However, most important: It should be FULLY PREDICTABLE and easy to UNDERSTAND and DEAL with. – Jakob Dec 13 '16 at 06:04
  • Well... It's not that English speakers are the only culture in the history of civilization that figured out how to sort and look up words in a dictionary and everybody else just do it randomly... – Álvaro González Dec 13 '16 at 09:39
  • 1
    Thanks, Alvaro. After reading what Pred wrote below, I also understand your answer. Yes, at the core of it all is what a user expects because of his cultural background. Thanks. – Jakob Dec 13 '16 at 12:37
  • If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com . – Rick James Mar 14 '17 at 22:20

4 Answers4

12

Collation and character set are two different things.

Character set is just an 'unordered' list of characters and their representation. utf8mb4 is a character set and covers a lots of characters.

Collation defines the order of characters (determines the end result of order by for example) and defines other rules (such as which characters or character combinations should be treated as same). Collations are derived from character sets, there can be more than one collation for the same character set. (It is an extension to the character set - sorta)

In utf8mb4_unicode_ci all (most?) accented characters are treated as the same character, this is why you get u and ü. In short this collation is an accent insensitive collation.

This is similar to the fact that German collations treat ss and ß as same.

utf8mb4_bin is another collation and it treats all characters as different ones. You may or may not want to use it as default, this is up to you and your business rules.

You can also convert the collation in queries, but be aware, that doing so will prevent MySQL to use indexes.

Here is an example using a similar, but maybe a bit more familiar part of collations:

The ci at the end of the collations means Case Insensitive and almost all collations with ci has a pair ending with cs, meaning Case Sensitive.

When your column is case insensitive, the where condition column = 'foo' will find all of these: foo Foo fOo FoO FOo FoO fOO, FOO.

Now if you try to set the collation to case sensitive (utf8mb4_unicode_cs for example), all the above values are treated as different values.

The localized collations (like German, UK, US, Hungarian, whatever) follow the rules of the named language. In Germany ss and ß are the same and this is stated in the rules of the German language. When a German user searches for a value Straße, they will expect that a software (supporting german language or written in Germany) will return both Straße and Strasse.

To go further, when it comes to ordering, the two words are the same, they are equal, their meaning is the same so there is no particular order.

Don't forget, that the UNIQUE constraint is just a way of ordering/filtering values. So if there is a unique key defined on a column with German collation, it will not allow to insert both Straße and Strasse, since by the rules of the language, they should be treated as equal.

Now lets see our original collation: utf8mb4_unicode_ci, This is a 'universal' collation, which means, that it tries to simplify everything so since ü is not a really common character and most users have no idea how to type it in, this collation makes it equal to u. This is a simplification in order to support most of the languages, but as you already know, these kind of simplifications have some side effects. (like in ordering, filtering, using unique constraints, etc).

The utf8mb4_bin is the other end of the spectrum. This collation is designed to be as strict as it can be. To achieve this, it literally uses the character codes to distinguish characters. This means, each and every form of a character are different, this collation is implicitly case sensitive and accent sensitive.

Both of these have drawbacks: the localized and general collations are designed for one specific language or to provide a common solution. (utf8mb4_unicode_ci is the 'extension' of the old utf8_general_ci collation)

The binary requires extra caution when it comes to user interaction. Since it is CS and AS it can confuse users who are used to get the value 'Foo' when they are looking for the value 'foo'. Also as a developer, you have to be extra cautious when it comes to joins and other features. The INNER JOIN 'foo' = 'Foo' will return nothing, since 'foo' is not equal to 'Foo'.

I hope that these examples and explanation helps a bit.

Pred
  • 8,789
  • 3
  • 26
  • 46
  • Hi Pred. Thanks a lot, especially for EXPLAINING which is which. I have spent weeks of my life with character_set and collation issues and work arounds. Now, I am at the point that I want to really UNDERSTAND what is going on and find a solution which works at least in most cases while KNOWING EXACTLY in which cases it doesn't work AND UNDERSTANDING WHY. I didn't know up to now that collation also defines which characters are TREATED AS SAME DURING COMPARISONS although the collation also defines that they ARE DISPLAYED DIFFERENTLY. That kind of blows my logical mind apart. – Jakob Dec 13 '16 at 05:44
  • utf8mb4_bin WORKS. Thanks. However, I dont understand why and I don't understand the drawbacks of this solution. As I dont know all of that I dont know how to integrate this into query from php etc. So, it is time for some more reading about bin. You already helped me a lot. THANKS. – Jakob Dec 13 '16 at 05:48
  • @Jakob Updated the answer with examples and whatnot, I hope it helps :) – Pred Dec 13 '16 at 07:47
  • Thanks for this great great answer! You do not only explain but you also add examples which make it easy to put your explanations into the right context. Now, I see. At the core of all of this is the user interaction. The German user wants ß and ss to be treated the same way. And the French user might want the a and à be treated the same as he might not always be sure whether he has do add the accent or not. If a programmer doesn't use a localized collation, he must take care of all these user expectations by adjusting his code. I got it now. Finally. Wow. I UNDERSTAND. – Jakob Dec 13 '16 at 11:22
  • You are welcome and I wish happy coding. I would suggest to play with these examples and different collations first. And most importantly: Understand your business needs, that will help to choose the correct collation. E.g. if you are writing a software for German or French users only, you can use the localized collations. General one is fine for most use cases, it eliminates lots of possible misunderstanding by treating lots of characters as the same. Binary is hard to work with, it comes with lots of restrictions which may need more workarounds in the application's code. – Pred Dec 13 '16 at 12:02
  • And: If I there is NO USER INTERACTION and I want the system to do exactly what I tell it to do, I must go for a binary collation (utf8mb4_bin). Then every tiny little detail will be treated as different. Finally, I got it. THANKS! – Jakob Dec 13 '16 at 12:21
  • There is no system with no user interaction :) Yes, if you want to represent everything as is, you can use *_bin collations. Be aware, that this means, that you have to deal with case sensitivity too when it comes to joins, filtering and ordering. – Pred Dec 13 '16 at 12:23
  • For example when you want to join two tables based on a character key, the key should be the same in both tables, including accents and lower/upper case letters: `FOO != Foo`. Or if you want to do it in a case insensitive way, you'll have to transform both sides (`LOWER('FOO') = LOWER('Foo')`). The second one means, that MySQL won't use indexes, since it has to apply the transformation which renders the indexes useless. – Pred Dec 13 '16 at 12:41
  • Thanks for these examples. Examples are great. A solution would be to adjust everything to lower case before inserting it into an INDEX column and then when looking for a string, to prepare the string outside of mysql (e.g.: change everything to LOWERCASE) before looking for it in the index column. I get the whole picture now. Also the drawbacks of binary. – Jakob Dec 13 '16 at 14:18
  • I just came here TO USE CAPS for some of my words BECAUSE MAYBE MARKDOWN wasn't supported FOUR YEARS AGO, but now we can USE MARKDOWN for **emphasis**. :) – Lemmings19 Oct 03 '20 at 19:57
4

utf8_collations.html lists what letters are 'equal' in the various utf8 (or utf8mb4) collations. With rare exceptions, all accents are stripped before comparing in any ..._ci collation. Some of the exceptions are language-specific, not Unicode in general. Example: In Icelandic É > E.

..._bin is the only collation that honors the treats accented letters as different. Ditto for case folding.

If you are doing a lot of comparing, you should change the collation of the column to ..._bin. When using the COLLATE clause in WHERE, an index cannot be used.

A note on ß. ss = ß in virtually all collations. In particular, utf8_general_ci (which used to be the the default) treated them as unequal. That one collation made no effort to treat any 2-letter combination (ss) as a single 'letter'. Also, due to a mistake in 5.0, utf8_general_mysql500_ci treats them unequal.

Going forward, utf8mb4_unicode_520_ci is the best through version 5.7. For 8.0, utf8mb4_0900_ai_ci is 'better'. The "520" and "900" refer to Unicode standards, so there may be even newer ones in the future.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

You can try the utf8_bin collation and you shouldn't face this issue, but it will be case sensitive. The bin collations compare strictly, only separating the characters out according to the encoding selected, and once that's done, comparisons are done on a binary basis, much like many programming languages would compare strings.

  • Hi Veer. You write: "...only separating the characters out according to the encoding selected,..." . Can you give me an example for "encoding selected"? Do you mean uft8 or utf8mb4? – Jakob Dec 13 '16 at 12:40
  • Try using SELECT * FROM x WHERE 'ü' COLLATE utf8_bin = column. Let me know how it works out. –  Dec 13 '16 at 13:19
  • Hi Veer. Big smile now. That is exactly what I did just now. I made an add on to my question above. Yes, it works. – Jakob Dec 13 '16 at 13:32
  • Glad to know that. Keep on rockin'! –  Dec 13 '16 at 17:16
  • 1
    @jakob Another update on this from MySQL: http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/ –  Jan 31 '17 at 09:19
0

I'll just add to the other answers that a _bin collation has its peculiarities as well.

For example, after the following:

CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');

this will fail:

INSERT INTO `dummy` (`key`) VALUES ('one ');

This is described in The binary Collation Compared to _bin Collations.

Edit: I've posted a related question here.

Community
  • 1
  • 1
Sea Coast of Tibet
  • 5,055
  • 3
  • 26
  • 37