Difficulties unescaping Unicode sequences

Question

I have a few large JSON files from a source I don't control that I'm trying to clean up in Notepad++ before using them as program input.

The file contains a lot of unicode sequences, which I unfortunately know very little about. It's the type using two or three sequences to represent one character, such as \u00c3\u00a9 for é, and \u00e2\u0080\u0094 for an em dash (—).

I've spent all night Googling how to convert these back into normal characters, but unfortunately I don't understand much of what I came across.

I did eventually figure out that by installing the HTML Tag plugin, I can use "Decode JS" on them, then convert the whole file to ANSI and then represent it as UTF-8, which fixes the issue with most of the characters.

But some, such as the em dash or Ç (\u00c3\u0087), still refuse to be converted.

Can someone please point me toward why these particular characters still display incorrectly, and how I can fix it? Thanks.

score 2 · Answer 1 · answered May 22 '18 at 16:30

The JSON was written incorrectly to begin with. The string data probably written to a database configured for storing latin1 data, but written encoded as UTF-8, then read back as latin1 data.

If you use a JSON library to read in the JSON, the strings in the data will need to be encoded as latin1 to reverse the error, then decoded as UTF-8 to interpret it properly.

Here's an example in Python 3:

#!coding:utf8
import json

raw = '"\u00c3\u00a9\u00e2\u0080\u0094\u00c3\u0087"' # Your é—Ç examples.
data = json.loads(raw)
print(data) # garbage
print(data.encode('latin1').decode('utf8')) # corrected

Output:

Ã©âÃ
é—Ç

score 0 · Answer 2 · answered May 22 '18 at 14:04

You can just import the files into the JavaScript program that requires the JSON data, parse the JSON files, and then pass the result into the decodeURIComponent method. In the following code snippet, I have a mini-JSON string, which I then parse, but you can replace the value of the json variable with your file.

    var json = `{"data" : "\u0024 equals the Dollar sign"}`
    var res = JSON.parse(json)
    console.log(res)
    var result = decodeURIComponent(res["data"]);
console.log(result)

However, I cannot recognise the "type" of the Unicode sequences you provide, such as the escape sequence for the em dash. If you could provide more information in your question about the type of Unicode escape sequences within the files, it would be appreciated.

Difficulties unescaping Unicode sequences

2 Answers2