Situation
I'm importing huge JSON files into a database. It contains fields that were filled in by users using an online wysiwyg editor. This allowed them to also paste in special characters, typically copied from a MS Word document.
Problem
After decoding the JSON file, a couple of special characters are left out. Turns out most of them are unicode control characters for example † which is character U+0086.
Example
<?php
$json = '{"test": "start \u0086 end"}';
$decoded = json_decode($json);
echo $decoded->test . PHP_EOL;
Output:
start end
Expected output:
start † end
Temporary fix
For the moment I applied this dirty fix, but I'm still looking for a more elegant way to replace all unicode characters.
protected static function replaceUnicodeCharacters(&$string)
{
$replace = [
"\u0086" => "†",
"\u00b0" => "°",
"\u0093" => "“",
"\u0094" => "”",
"\u0091" => "‘",
"\u0092" => "’",
"\u009c" => "œ",
"\u00f6" => "ö",
"\u00f9" => "ù",
"\u00ad" => "",
"\u0096" => "–",
"\u00fb" => "û",
"\u00a0" => " ",
"\u0085" => "…",
"\u00ab" => "«",
"\u00bb" => "»",
"\u008c" => "Œ",
"\u00c0" => "À",
"\u00ff" => "ÿ",
"\u00fc" => "ü",
];
$string = str_ireplace(array_keys($replace), array_values($replace), $string);
}