2

I have downloaded JSON data from Instagram that I'm parsing in NodeJS and storing in MongoDB. I'm having an issue where escaped unicode characters are not displaying the correct emoji symbols when displayed on the client side.

For instance, here's a property from one of the JSON files I'm parsing and storing:

"title": "@mujenspirits is in the house!NEW York City \u00f0\u009f\u0097\u00bd\u00f0\u009f\u008d\u008e \nImperial Vintner Liquor Store"

The above example should display like this:

@mujenspirits is in the house!NEW York City Imperial Vintner Liquor Store

But instead looks like this:

@mujenspirits is in the house!NEW York City 🗽🎠Imperial Vintner Liquor Store

I found another SO question where someone had a similar problem and their solution works for me in the console using a simple string, but when used with JSON.parse still gives the same incorrect display. This is what I'm using now to parse the JSON files.

export default function parseJsonFile(filepath: string) {
  const value = fs.readFileSync(filepath)
  const converted = new Uint8Array(
    new Uint8Array(Array.prototype.map.call(value, (c) => c.charCodeAt(0)))
  )
  return JSON.parse(new TextDecoder().decode(converted))
}

For posterity, I found an additional SO question similar to mine. There wasn't a solution, however, one of the comments said:

The JSON files were generated incorrectly. The strings represent Unicode code points as escape codes, but are UTF-8 data decoded as Latin1

The commenter suggested encoding the loaded JSON to latin1 then decoding to utf8, but this didn't work for me either.

import buffer from 'buffer'

const value = fs.readFileSync(filepath)
const buffered = buffer.transcode(value, 'latin1', 'utf8')
return JSON.parse(buffered.toString())

I know pretty much nothing about character encoding, so at this point I'm shooting in the dark searching for a solution.

bflemi3
  • 6,698
  • 20
  • 88
  • 155
  • 2
    The comment in the other SO question (it was mine) is correct in this case as well. The Unicode characters in the string are the ones you see printed, and not the emoji you want due to being misdecoded as Latin-1. For example, in Python to reverse the problem `'\u00f0\u009f\u0097\u00bd\u00f0\u009f\u008d\u008e'.encode('latin1').decode('utf8')` produces `''` – Mark Tolonen Jul 13 '22 at 20:11
  • 1
    That was my upvote on your comment :). I tried (in node) doing just that but didn't have any luck. Like I said though, I know pretty much nothing about character encoding so I may have done something wrong. – bflemi3 Jul 13 '22 at 20:16
  • How did you get the string in the first place? It looks like the file was read using a Latin-1 codec instead of a UTF-8 codec to get that Unicode string result. – Mark Tolonen Jul 13 '22 at 20:21
  • @MarkTolonen I updated my question to include my encode/decode attempt. – bflemi3 Jul 13 '22 at 20:23
  • From googling try: `fs.readFileSync(filepath, 'utf8');` – Mark Tolonen Jul 13 '22 at 20:25
  • Nope, didn't work, same result. – bflemi3 Jul 13 '22 at 20:33
  • I also see it written like `fs.readFileSync('./input2.txt', {encoding:'utf8', flag:'r'});`. If that doesn't work the data received could be what is encoded incorrectly in the first place. How was it downloaded from Instagram? – Mark Tolonen Jul 13 '22 at 20:38
  • Directly from instagram.com, not programmatically. The user logged into their account and went to Settings > Privacy and security > Data Download and chose JSON as the format. – bflemi3 Jul 13 '22 at 20:41
  • The zip file they received was then uploaded to S3 via our application. Then a lambda running node v14 is extracting the data from the zip, processing it and inserting into Mongo. – bflemi3 Jul 13 '22 at 20:56
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246419/discussion-between-mark-tolonen-and-bflemi3). – Mark Tolonen Jul 13 '22 at 20:57

2 Answers2

2

An easy solution is to decode the string with the uft8 package

npm install utf8

Now as an example of use, look at this code that uses nodejs and express:

import express from "express";
import uft8 from "utf8";

const app = express();
app.get("/", (req, res) => {
  const text = "\u00f0\u009f\u0097\u00bd\u00f0\u009f\u008d\u008e it is a test";
  const textDecode = uft8.decode(text);

  console.log(textDecode);
  res.send(textDecode);
});

const port = process.env.PORT || 5000;
app.listen(port, () => {
  console.log("Server on port 5000");
});

The result is that in localhost:5000 you will see the emojis without problem. You can apply this idea to your project, to treat the json with emojis.

And here is an example from the client side:

const element= document.getElementById("text")
const txt = "\u00f0\u009f\u0097\u00bd\u00f0\u009f\u008d\u008e it is a test"
const text= utf8.decode(txt)
console.log(text)
element.innerHTML= text
<script src="https://cdnjs.cloudflare.com/ajax/libs/utf8/2.1.1/utf8.min.js" integrity="sha512-PACCEofNpYYWg8lplUjhaMMq06f4g6Hodz0DlADi+WeZljRxYY7NJAn46O5lBZz/rkDWivph/2WEgJQEVWrJ6Q==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>

<p id="text"></p>
Usiel
  • 671
  • 3
  • 14
  • Thanks for your reply. I haven't tried using the `utf8` package yet. I won't have time to verify this until Monday. – bflemi3 Jul 22 '22 at 11:23
  • This didn't work for me, still having the same issue. You're example over-simplified my issue leaving out `readFileSync(filepath, 'utf8')`. When I passed the string returned from `readFileSync` to `utf8` then did `JSON.parse` I still had the same result. – bflemi3 Jul 24 '22 at 13:29
1

You can try converting the unicode escape sequences to bytes before parsing the JSON; probably, the utf8.js library can help you with that.

Alternatively, the solution you found should work but only after unserializing the JSON (it will turn each unicode escape sequence into one character). So, you need to traverse the object and apply the solution to each string

For example:

function parseJsonFile(filepath) {
  const value = fs.readFileSync(filepath);
  return decodeUTF8(JSON.parse(value));
}

function decodeUTF8(data) {
  if (typeof data === "string") {
    const utf8 = new Uint8Array(
      Array.prototype.map.call(data, (c) => c.charCodeAt(0))
    );
    return new TextDecoder("utf-8").decode(utf8);
  }

  if (Array.isArray(data)) {
    return data.map(decodeUTF8);
  }

  if (typeof data === "object") {
    const obj = {};
    Object.entries(data).forEach(([key, value]) => {
      obj[key] = decodeUTF8(value);
    });
    return obj;
  }

  return data;
}
mdker
  • 1,234
  • 7
  • 11