Be aware that ord
operates in the ASCII range, matching characters in single-byte encoding only, and will not help you with multibyte Unicode characters outside the 0-255 range.
How to Match Combined Diacritics
You can use preg_match
with the Unicode u
flag, and then match the appropriate Unicode character range. In this case, \p{M}
will do the job. It stands for:
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
Applied as follows:
$a = 'Anão';
$b = 'Anão';
var_dump([
preg_match('~\p{M}~u', $a), // = 0
preg_match('~\p{M}~u', $b) // = 1
]);
Returns 0
and 1
: Your $b
string has a combining diacritical mark. Then, you would check if(preg_match('~\p{M}~u', $str))
to find out if a string has combining diacritics.
This would match all types of combining diacritics. If you wanted to target the exact character class the combining umlaut diacritic belongs to, it'd be in the {Mn}
range:
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
How to Normalize Diacritics
If your question stems from "how do I make these strings equivalent", because when $a != $b
even though they look the same, it's obviously problematic. PHP has a convenient Normalizer class for converting Unicode strings to their canonical forms. Used as follows:
Normalizer::normalize('Anão', Normalizer::NFC); // Single Char, Default
Normalizer::normalize('Anão', Normalizer::NFD); // Combined
Here, NFC (default), or Normalization Form C, stands for "Canonical Decomposition, followed by Canonical Composition", where the character is first split to its parts, and then composed as far as possible, often into a single character. Again, NFD, Normalization Form D (NFD), stands for "Canonical Decomposition", where diacritics become separate combining characters, etc.
If you normalized all strings that potentially contain diacritics, both in your source data and in queries made against it, I suspect your original question would not arise.
P.S. See regular-expressions.info for a useful Unicode reference for Regex cheat sheet, and the Unicode character property / Categories table at Wikipedia.