The approach shown in the question should (mostly) work just fine in MySQL for the following reasons:
Collation (not to be confused with encoding) is the set or rules that define how to sort and compare characters, typically used to replicate at database level the user expectations from a cultural perspective (if I search for cafe
I expect to find café
as well).
Collation plays an important rule on unique constraints because its establishes the definition of unique.
Binary collations are specifically meant to ignore cultural rules and work at byte level, thus utf8mb4_bin
is the right choice here.
MySQL allows to set a combination of encoding and collation with a column level granularity.
If a column definition is missing collation, it'll use the table level one.
If a table definition is missing collation, it'll use the database level one.
If a database definition is missing collation, it'll use the server level one.
It's also worth noting that MySQL will convert between encodings transparently as long as:
- Connection encoding is properly set
- Conversion is physically possible (e.g. all source characters also belong to target encoding)
For this last reason, VARBINARY
is possibly not the best choice for a column that's still text because it opens the door to getting café
stored from a connection configured to use ISO-8859-1 and not being able to retrieve it correctly from a connection configured to use UTF-8.
Side note: the table definition shown may trigger the following error:
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes
Indexes may have a relatively small maximum size. From docs:
If innodb_large_prefix is enabled (the default), the index key prefix
limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED
row format. If innodb_large_prefix is disabled, the index key prefix
limit is 767 bytes for tables of any row format.
innodb_large_prefix is deprecated and will be removed in a future
release. innodb_large_prefix was introduced in MySQL 5.5 to disable
large index key prefixes for compatibility with earlier versions of
InnoDB that do not support large index key prefixes.
The index key prefix length limit is 767 bytes for InnoDB tables that
use the REDUNDANT or COMPACT row format. For example, you might hit
this limit with a column prefix index of more than 255 characters on a
TEXT or VARCHAR column, assuming a utf8mb3 character set and the
maximum of 3 bytes for each character.
Attempting to use an index key prefix length that exceeds the limit
returns an error. To avoid such errors in replication configurations,
avoid enabling innodb_large_prefix on the master if it cannot also be
enabled on slaves.
Since utf8_mb8 allocates 4 bytes per character, a 767 limit will be overflowed with only 192 characters.
We have one more problem:
mysql> CREATE TABLE `dummy` (
-> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL,
-> UNIQUE KEY `key` (`key`)
-> )
-> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one ');
ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'
Pardon?
mysql> INSERT INTO `dummy` (`key`) VALUES ('One');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('öne');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM `dummy`;
+-----+
| key |
+-----+
| One |
| one |
| öne |
+-----+
3 rows in set (0.00 sec)
This last issue is a interesting subtlety of MySQL collations. From docs:
All MySQL collations are of type PADSPACE. This means that all CHAR,
VARCHAR, and TEXT values in MySQL are compared without regard to any
trailing spaces. “Comparison” in this context does not include the
LIKE pattern-matching operator, for which trailing spaces are
significant
[...]
For those cases where trailing pad characters are stripped or
comparisons ignore them, if a column has an index that requires unique
values, inserting into the column values that differ only in number of
trailing pad characters will result in a duplicate-key error.
I'd dare say then that VARBINARY
type is the only way to overcome this...