What you're describing is Unicode canonical equivalence.
I figured using utf8mb4_unicode_ci
collation would solve this for you. However the documentation implied that it would not:
A combined character will be considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the CHAR_LENGTH() function or in result set metadata).
However a quick test seems to indicate that is incorrect:
mysql -u root -e "SELECT 'a<0301>' = 'á' COLLATE utf8mb4_unicode_ci;"
+-----------------------------------------+
| 'á' = 'á' COLLATE utf8mb4_unicode_ci |
+-----------------------------------------+
| 1 |
+-----------------------------------------+
Confusing.. though I wonder if that sentence is only applies in the context of the previous two sentences:
Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo.
So anyway, that may work for you. It is worth noting that utf8mb4_unicode_ci
will result in relatively loose matching, e.g. á
and a
will be treated equivalent:
mysql -u root -e "SELECT 'á' = 'a' COLLATE utf8mb4_unicode_ci;"
+---------------------------------------+
| 'á' = 'a' COLLATE utf8mb4_unicode_ci |
+---------------------------------------+
| 1 |
+---------------------------------------+
Another option, should you wish to have finer control on this, is to normalize text before insert into your database (intl extention required). Whether or not you'll want to do this depends on how interested you are in keeping it in it's absolute original form. The normalization process guarantees visual equivalence, so it should be safe to apply. For example, if you were to normalize to the composed form (which would be most storage efficient, should you care):
<?php
$a = 'á'; // 0xC3 0xA1
$b = 'á'; // 0x61 0xCC 0x81
$ca = \Normalizer::normalize($a, \Normalizer::FORM_C);
$cb = \Normalizer::normalize($b, \Normalizer::FORM_C);
$da = \Normalizer::normalize($a, \Normalizer::FORM_D);
$db = \Normalizer::normalize($b, \Normalizer::FORM_D);
var_dump($a === $b); // FALSE
var_dump($a === $cb); // TRUE, $a is already composed
var_dump($ca === $cb); // TRUE, $a is unchanged by normalizer
var_dump($b === $da); // TRUE, $b is already decomposed
var_dump($db === $da); // TRUE, $b is unchanged by normalizer