How to make MySQL handle strings like SQLite does, with regard to Unicode and collation?

Question

I've been researching this question for several hours now, on SO, in MySQL docs, and elsewhere, but still can't find a satisfactory solution. The problem is:

What is the simplest way to make MySQL treat strings just like SQLite does, without any extra "smart" conversions?

For example, the following works perfectly in SQLite:

CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);

INSERT INTO `dummy` (`key`) VALUES ('one');
INSERT INTO `dummy` (`key`) VALUES ('one ');
INSERT INTO `dummy` (`key`) VALUES ('One');
INSERT INTO `dummy` (`key`) VALUES ('öne');

SELECT * FROM `dummy`;

However, in MySQL, with the following settings:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_bin

and the following CREATE DATABASE statement:

CREATE DATABASE `dummydb` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_bin;

it still fails on the second INSERT.

I'd rather keep string column declarations as simple as possible, SQLite's TEXT being the ideal. Looks like VARBINARY is the way to go, but I would still like to hear your opinions on any other, potentially better options.

Addendum: The SHOW CREATE TABLE dummy output is

mysql> SHOW CREATE TABLE dummy;
+-------+-----------------------------------------------------
| Table | Create Table                                        
+-------+-----------------------------------------------------
| dummy | CREATE TABLE `dummy` (
  `key` varchar(255) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY `key` (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+-----------------------------------------------------
1 row in set (0.00 sec)

Those are defaults. Do you override them in your `CREATE TABLE` statement? — Álvaro González, Feb 21 '17 at 15:38
The output ends with `ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin`. All working fine here, perhaps the key length error your side is related to some missing or extra backticks? — Sea Coast of Tibet, Feb 21 '17 at 15:49
And the key is `key varchar(255) COLLATE utf8mb4_bin NOT NULL, UNIQUE KEY key (key)` (I've stripped backticks). — Sea Coast of Tibet, Feb 21 '17 at 15:51
Please provide the output from `SHOW CREATE TABLE dummy`; we need to see the entire output; the last line is only a _default_. — Rick James, Feb 21 '17 at 17:03
Rick, I've added the output to the question. Nothing unusual there. — Sea Coast of Tibet, Feb 21 '17 at 17:11
I added onto my answer -- the accents and case are handled by Collation; the trailing spaces are handled by the datatype and a setting. — Rick James, Feb 21 '17 at 17:28
I would argue that SQLite fails to get into the spirit of UTF8, so it is the deficient one. (Granted, MySQL's complexity keeps questions coming into this forum!) — Rick James, Feb 21 '17 at 17:30

Rick James · Accepted Answer · 2017-02-21T17:24:50.350

1

MySQL wants to convert strings when doing INSERT and SELECT. The conversion is between what you declare the client to have and what the column is declared to be storing.

The only way to avoid that is with VARBINARY and BLOB instead of VARCHAR and TEXT.

The use of COLLATION utf8mb4_bin does not avoid conversion to/from CHARACTER SET utf8mb4; it merely says that WHERE and ORDER BY should compare the bits instead of dealing with accents and case folding.

Keep in mind that CHARACTER SET utf8mb4 is a way to encode text; COLLATION utf8mb4_* is rules for comparing texts in that encoding. _bin is simpleminded.

UNIQUE involves comparing for equality, hence COLLATION. In most utf8mb4 collations, the 3 (without spaces) will compare equal. utf8mb4_bin will treat the 3 as different. utf8mb4_hungarian_ci treats one=One>öne.

The trailing spaces are controlled by the datatype of the column (VARCHAR or other). The latest version even has a setting relating to whether to consider trailing spaces.

edited Feb 21 '17 at 17:24

answered Feb 21 '17 at 17:09

Rick James

135,179
13
127
222

Not sure I follow you but collation affects unique indexes as well. Since he actually has text (and not binary data as e.g. a raw hash) his approach is just fine (and it *should* work). – Álvaro González Feb 21 '17 at 17:16
Thanks for your answer. I've switched to `VARBINARY`, and have reverted the collation to `utf8mb4_unicode_ci`, for the weird to be more obvious. Yes, I intend to keep the potential implicit conversions in mind when writing queries. – Sea Coast of Tibet Feb 21 '17 at 17:20
Regarding trailing spaces, do you mean 8.0.x by "the latest version"? I've seen `PADSPACE` mentioned in the 5.7.x docs, but not how to change it for a column. – Sea Coast of Tibet Feb 21 '17 at 17:44

Álvaro González · Answer 2 · 2017-02-22T10:49:03.153

The approach shown in the question should (mostly) work just fine in MySQL for the following reasons:

Collation (not to be confused with encoding) is the set or rules that define how to sort and compare characters, typically used to replicate at database level the user expectations from a cultural perspective (if I search for cafe I expect to find café as well).
Collation plays an important rule on unique constraints because its establishes the definition of unique.
Binary collations are specifically meant to ignore cultural rules and work at byte level, thus utf8mb4_bin is the right choice here.
MySQL allows to set a combination of encoding and collation with a column level granularity.
If a column definition is missing collation, it'll use the table level one.
If a table definition is missing collation, it'll use the database level one.
If a database definition is missing collation, it'll use the server level one.

It's also worth noting that MySQL will convert between encodings transparently as long as:

Connection encoding is properly set
Conversion is physically possible (e.g. all source characters also belong to target encoding)

For this last reason, VARBINARY is possibly not the best choice for a column that's still text because it opens the door to getting café stored from a connection configured to use ISO-8859-1 and not being able to retrieve it correctly from a connection configured to use UTF-8.

Side note: the table definition shown may trigger the following error:

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

Indexes may have a relatively small maximum size. From docs:

If innodb_large_prefix is enabled (the default), the index key prefix limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED row format. If innodb_large_prefix is disabled, the index key prefix limit is 767 bytes for tables of any row format.

innodb_large_prefix is deprecated and will be removed in a future release. innodb_large_prefix was introduced in MySQL 5.5 to disable large index key prefixes for compatibility with earlier versions of InnoDB that do not support large index key prefixes.

The index key prefix length limit is 767 bytes for InnoDB tables that use the REDUNDANT or COMPACT row format. For example, you might hit this limit with a column prefix index of more than 255 characters on a TEXT or VARCHAR column, assuming a utf8mb3 character set and the maximum of 3 bytes for each character.

Attempting to use an index key prefix length that exceeds the limit returns an error. To avoid such errors in replication configurations, avoid enabling innodb_large_prefix on the master if it cannot also be enabled on slaves.

Since utf8_mb8 allocates 4 bytes per character, a 767 limit will be overflowed with only 192 characters.

We have one more problem:

mysql> CREATE TABLE `dummy` (
    -> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL,
    -> UNIQUE KEY `key` (`key`)
    -> )
    -> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO `dummy` (`key`) VALUES ('one');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `dummy` (`key`) VALUES ('one ');
ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'

Pardon?

mysql> INSERT INTO `dummy` (`key`) VALUES ('One');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `dummy` (`key`) VALUES ('öne');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM `dummy`;
+-----+
| key |
+-----+
| One |
| one |
| öne |
+-----+
3 rows in set (0.00 sec)

This last issue is a interesting subtlety of MySQL collations. From docs:

All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the LIKE pattern-matching operator, for which trailing spaces are significant

[...] For those cases where trailing pad characters are stripped or comparisons ignore them, if a column has an index that requires unique values, inserting into the column values that differ only in number of trailing pad characters will result in a duplicate-key error.

I'd dare say then that VARBINARY type is the only way to overcome this...

Thanks for your detailed answer. To clarify: I'm not getting any errors (except the PADSPACE-induced INSERT error for 'one'), so the key length issues is probably something related to your particular MySQL setup. — Sea Coast of Tibet, Feb 22 '17 at 10:39
@SeaCoastofTibet The key size issue is explained in the documentation excerpts I've shared. However, I have to confess that I've misunderstood your question all the time until right now. I was stubbornly thinking that binary collation was not taking effect at all, which is not the case. I read too fast. I'm removing the related comments to avoid spreading misinformation. — Álvaro González, Feb 22 '17 at 10:54

How to make MySQL handle strings like SQLite does, with regard to Unicode and collation?

2 Answers2

Linked