2

i have this two example strings:

$a = 'Anão'; $b = 'Anão';

They visually look the same, but the 3rd character is different:

On string $a is Unicode 227 (latin small letter a with tilde) and on string $b is Unicode 97 (latin small letter a) + Unicode 771 (combining tilde)

How can i detect if a string contains any combining character, rather than the "regular" one?

I have tried to check all characters from the string with function "ord()" but it didn't work.

  • If I use `json_encode()` they look both the same. See: https://3v4l.org/uqObJ But perhaps that is one way to distinguish them? Not exactly sure why you want to do this? – KIKO Software Dec 09 '22 at 13:01
  • Could you show us you `ord()` code? It should work. Note: per unicode you should handle both case in the same manner. Usually we normalize the input (choosing one of the two canonical normalization forms) [note: there can be other equivalent but not-canonical forms] – Giacomo Catenazzi Dec 09 '22 at 13:11
  • 1
    See [this PHP demo at tio.run](https://tio.run/##pZLBasJAEIbv@xRjEDQoScTSgzYtWxtqIEapkR5qD2uyaYJ2E7IroYgnH6WP4tV3SjcGoaAUobd/Zv9/PmaZNEqL4u5hMpwgpOsQB5SJOPwCPyIZ8QXNeK/sR0KkvKfreZ5rC7KgKy4SRjU/0dZLfcZiPwmonkdExDwWWiQ@VwjVucg4mPDWwGz/nTTaIAU57Crl7nfjSuCDVO99hMIko8SPmlUQT6G@BPMeylJFGwRA/SgBZSqymH2AAho068tWR5VCKY1KX3risCn9acKPY9qgkPl6Y3SN7lZRoWaaEJIVp6p0nubNWoZx2wEHe7YL0xF2HHAsz7NeAANhAch3GYfBePRou7b7DJ7tPFlzdsTJn6DnTHwN8@bEHOCJ7eH/U0uoYf0NtbqXF321veG1iMGViLO9LkJ@pV15UxBQQeXhBbWjZ4uK4gc). [Here is a nice tool](https://www.babelstone.co.uk/Unicode/whatisit.html) to identify unicode characters. – bobble bubble Dec 09 '22 at 14:18
  • @KIKOSoftware one possible use case is when you get content from different sources with mixed practice, and you want to e.g. normalize everything to "single characters only". Mixed diacritics can lead to trouble when querying and matching data. – Markus AO Dec 09 '22 at 14:59
  • Does this answer your question? [Encoding issues lead to 2 folders created with same name in the same location](https://stackoverflow.com/questions/69790463/encoding-issues-lead-to-2-folders-created-with-same-name-in-the-same-location) – JosefZ Dec 09 '22 at 15:31
  • 1
    @JosefZ yes that's the perfect follow-up in case OP wants to normalize the data to only use a single diacritics normalization convention [NFD vs NFC](https://unicode.org/reports/tr15/#Norm_Forms). Which is a good idea, because when `$a != $b` although they look the same, headache follows. – Markus AO Dec 09 '22 at 16:16

2 Answers2

3

Be aware that ord operates in the ASCII range, matching characters in single-byte encoding only, and will not help you with multibyte Unicode characters outside the 0-255 range.

How to Match Combined Diacritics

You can use preg_match with the Unicode u flag, and then match the appropriate Unicode character range. In this case, \p{M} will do the job. It stands for:

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

Applied as follows:

$a = 'Anão';
$b = 'Anão';

var_dump([
    preg_match('~\p{M}~u', $a), // = 0
    preg_match('~\p{M}~u', $b) // = 1
]);

Returns 0 and 1: Your $b string has a combining diacritical mark. Then, you would check if(preg_match('~\p{M}~u', $str)) to find out if a string has combining diacritics.

This would match all types of combining diacritics. If you wanted to target the exact character class the combining umlaut diacritic belongs to, it'd be in the {Mn} range:

\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).

How to Normalize Diacritics

If your question stems from "how do I make these strings equivalent", because when $a != $b even though they look the same, it's obviously problematic. PHP has a convenient Normalizer class for converting Unicode strings to their canonical forms. Used as follows:

Normalizer::normalize('Anão', Normalizer::NFC); // Single Char, Default
Normalizer::normalize('Anão', Normalizer::NFD); // Combined

Here, NFC (default), or Normalization Form C, stands for "Canonical Decomposition, followed by Canonical Composition", where the character is first split to its parts, and then composed as far as possible, often into a single character. Again, NFD, Normalization Form D (NFD), stands for "Canonical Decomposition", where diacritics become separate combining characters, etc.

If you normalized all strings that potentially contain diacritics, both in your source data and in queries made against it, I suspect your original question would not arise.


P.S. See regular-expressions.info for a useful Unicode reference for Regex cheat sheet, and the Unicode character property / Categories table at Wikipedia.

Markus AO
  • 4,771
  • 2
  • 18
  • 29
1

You can do a bunch of comparisons to check the equality.

$a = 'Anão';
$b = 'Anão';

$c = iconv('UTF-8', 'ASCII//TRANSLIT', $a);
$d = iconv('UTF-8', 'ASCII//TRANSLIT', $b);

echo ($c === $d ? 'same meaning' : 'different meaning'), PHP_EOL;
echo ($a === $b ? 'same string'  : 'different string'), PHP_EOL;
echo ($a === $c ? 'a has no encoded characters' : 'a has encoded characters'), PHP_EOL;
echo ($b === $d ? 'b has no encoded characters' : 'b has encoded characters'), PHP_EOL;

Output

same meaning
different string
a has encoded characters
b has encoded characters
Markus Zeller
  • 8,516
  • 2
  • 29
  • 35
  • `$d = iconv('UTF-8', 'ASCII//TRANSLIT', $b);` curiously throws _iconv(): Detected an illegal character in input string_ (`$b`) for me, not sure what's going on there (PHP 8.1.5, Windows), works fine in [3v4l](https://3v4l.org/9u6NE). How do these comparisons help OP tell if `$a` or `$b` has a combining umlaut, though? – Markus AO Dec 09 '22 at 15:11
  • My answer may not be 100% what OP asks, but helps detecting "unnormal" chars. So any Umlaut is being detected. I don't see any error running PHP 8.2 WSL2 Ubuntu 22.02 – Markus Zeller Dec 09 '22 at 15:35
  • 1
    Yes it's definitely a useful answer. PHP for Windows and Linux possibly still use different iconv libraries, that'd explain the error on Windows. I bump into it every so often when I try to use iconv on anything halfway exotic, and have to add `//IGNORE` to get somewhere. (I typically end up finding another solution, I don't like "ignore" and the possible resulting loss of data.) – Markus AO Dec 09 '22 at 16:06