detect wrong written umlauts

Question

We need to import CSV files to MySQL which contain wrong written umlauts.

E.g.: instead of Ü (ASCII 154), someone with a non German keyboard entered U (ASCII 85) and added two top dots using ASCII 249, which looked the same to him.

MySQL writes this as U? to the DB. That's why we want PHP to detect non ASCII character combinations, like this combination of a printable ASCII character and an extended ASCII character, that does not exist in the real world, at least not in the major languages.

The preg_replace functions we have tried, do not detect this or detect also valid umlauts.

Any chance to succeed with preg_replace or is there another way?

You may match those combination with `preg_match_all('~\p{L}\p{M}+~u', $s, $m)`. But I doubt you may easily replace them with corresponding wide char Unicode letter. Perhaps, you need a multibyte to wide char letter mapping. — Wiktor Stribiżew, Jun 21 '17 at 12:40
When you read any text file, including CSV, you have to use the character encoding that the writer used. So, what is the encoding of the CSV file? (ASCII doesn't have a code unit or codepoint numbered 154 or 249.) Is it [IBM850](https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)#Comparison_table)? Once you get the text read in correctly, you could replace incorrect representations of each umlaut character ("U¨" with "Ü"). — Tom Blodget, Jun 21 '17 at 23:51

score 2 · Accepted Answer · edited Sep 24 '17 at 04:58

2

Since you want to use PHP code to detect any combination of a base letter followed with 1 or more diacritic symbols, you may use

if (preg_match('~\p{L}\p{M}~u', $s, $m)) {
    echo "There is a multibyte char here: " . $m[0];
}

Note that:

\p{L} - matches any Unicode letter
\p{M} - matches any diacritic symbol (a combining mark)

The u modifier enables (*UTF) and (*UCP) PCRE flags that make the PCRE engine treat both the string and the pattern in a Unicode aware mode.

edited Sep 24 '17 at 04:58

Graham

7,431
18
59
84

answered Jun 21 '17 at 13:08

Wiktor Stribiżew

607,720
39
448
563

1

Even better! :-) – user2113177 Jun 21 '17 at 14:24

score 0 · Answer 2 · answered Jun 21 '17 at 12:53

0

Here's something that will potentially work:

$contents = str_replace(chr(85).chr(249),chr(154), file_get_contents("mycsv.csv"));

Then do the recommended thing switch your DB to UTF-8 and do:

$utfText = mb_convert_encoding($contents,"UTF-8","ISO-8859-1"); //I think that's the ISO standard you are referring to

answered Jun 21 '17 at 12:53

apokryfos

38,771
9
70
114

Thanks. But only thing we need is a validation. So just warn the user, if non-standard characters are detected. – user2113177 Jun 23 '17 at 13:43
`strpos(file_get_contents("mycsv.csv"),chr(85).chr(249)) !== false` would return `true` if the string contains character 85 followed by character 249. However my UTF-8 conversion suggestion remains since it seems it's currently using character set which doesn't work with what you're giving it. – apokryfos Jun 23 '17 at 13:58

user2113177 · Answer 3 · 2017-06-21T14:00:11.093

0

Wiktor (first comment) nailed it.

We don't need to replace, just a warning is fine for us, since it is a rare case that should be fixed in the CSV file anyway.

'~\p{L}\p{M}+~u'

does the job.

edited Jun 21 '17 at 14:00

answered Jun 21 '17 at 12:58

user2113177

350
3
16

If it works for you, I can post a full answer with explanations myself. "Does the job" is not really a helpful type of answer. See my answer below. – Wiktor Stribiżew Jun 21 '17 at 13:05

detect wrong written umlauts

3 Answers3