Decode unicode control characters in JSON string

Question

Situation

I'm importing huge JSON files into a database. It contains fields that were filled in by users using an online wysiwyg editor. This allowed them to also paste in special characters, typically copied from a MS Word document.

Problem

After decoding the JSON file, a couple of special characters are left out. Turns out most of them are unicode control characters for example † which is character U+0086.

Example

<?php
$json = '{"test": "start \u0086 end"}';
$decoded = json_decode($json);
echo $decoded->test . PHP_EOL;

Output:

start  end

Expected output:

start † end

Temporary fix

For the moment I applied this dirty fix, but I'm still looking for a more elegant way to replace all unicode characters.

protected static function replaceUnicodeCharacters(&$string)
{
    $replace = [
        "\u0086" => "†",
        "\u00b0" => "°",
        "\u0093" => "“",
        "\u0094" => "”",
        "\u0091" => "‘",
        "\u0092" => "’",
        "\u009c" => "œ",
        "\u00f6" => "ö",
        "\u00f9" => "ù",
        "\u00ad" => "",
        "\u0096" => "–",
        "\u00fb" => "û",
        "\u00a0" => " ",
        "\u0085" => "…",
        "\u00ab" => "«",
        "\u00bb" => "»",
        "\u008c" => "Œ",
        "\u00c0" => "À",
        "\u00ff" => "ÿ",
        "\u00fc" => "ü",
    ];

    $string = str_ireplace(array_keys($replace), array_values($replace), $string);
}

Not one of my test browsers manages to display the “UTF-8” version of this character on https://www.fileformat.info/info/unicode/char/0086/browsertest.htm correctly, only the decimal/hex HTML escape seem to work properly. — misorude, Aug 01 '19 at 12:50

score 0 · Answer 1 · answered Aug 01 '19 at 13:05

0

0x86 when interpreted as Windows-1252 is †. You're just missing an encoding step:

$decoded->test = mb_convert_encoding($decoded->test, "Windows-1252", "UTF-8");
echo '<html><meta charset="Windows-1252">';
echo $decoded->test . PHP_EOL;

answered Aug 01 '19 at 13:05

daxim

39,270
4
65
132

For me this results in `start � end`. Echoing the charset is not possible as I'm using a console application. – Bram Verstraten Aug 01 '19 at 13:18
Then set the console's encoding appropriately. Looks like you have configured it to display UTF-8, but as I'm trying to get across here is that your data is in Windows-1252. – daxim Aug 01 '19 at 13:33

ya_Bob_Jonez · Answer 2 · 2019-12-04T11:31:54.970

-1

EDIT: PHP Unicode in JSON

I hope maybe at least, that helps...

edited Dec 04 '19 at 11:31

answered Aug 01 '19 at 12:54

ya_Bob_Jonez

38
1
9

This yields NULL on PHP 7.3 and Uncaught JsonException: Syntax error in php shell code:1 with **`JSON_THROW_ON_ERROR`** enabled – Maxime Launois Aug 01 '19 at 13:02
Okay, I'm sorry. Just tried to help. :( This is because this character is not supported by most of the browsers, maybe. PHP can't display the cross properly D: I guess. – ya_Bob_Jonez Aug 01 '19 at 14:25

Decode unicode control characters in JSON string

Situation

Problem

Example

Temporary fix

2 Answers2