3

I have an xml file encoded in UTF16, and I would like to convert it to UTF8 in order to process it. If I use this command:

iconv -f UTF-16 -t UTF-8 file.xml > converted_file.xml

The file is converted correctly and I'm able to process it. I want to do the same in nodejs.

Currently I have a buffer of my file and I've tried everything I could think of and what I could find on the internet but unsuccessfully.

Here is some examples of what I've tried so far:

content = new Buffer((new Buffer(content, 'ucs2')).toString('utf8'));

I've also tried using those functions:

http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/ https://stackoverflow.com/a/14601808/1405208

The first one doen't change anything and the links only give me chinese characters.

Community
  • 1
  • 1
Julien Fouilhé
  • 2,583
  • 3
  • 30
  • 56

2 Answers2

5
var content = fs.readFileSync('myfile.xml', {encoding:'ucs2'});
fs.writeFileSync('myfile.xml', content, {encoding:'utf8'});
Arnaud Gueras
  • 2,014
  • 11
  • 14
  • What if I'm not sure my file will be utf16 encoded? – Julien Fouilhé Nov 18 '15 at 13:31
  • That's something you'll have to limit. Either validate that it's utf16 or figure out the type. Typically you can query incoming data to find the type and insert said type for ucs2 – Ravenous Nov 18 '15 at 14:57
3

While the answer above me is the best answer for the question asked. I'm hoping that this answer will help some folks that need to read a file as a binary string:

const reader = new FileReader();
reader.readAsBinaryString(this.fileToImport);

In my case the file was in utf-16 and I tried to read it into XLSX:

const wb = XLSX.read(bstr, { type: "binary" });

Combining both links from above, I first removed the first two chars that signaled it was UTF-16 (0xFFFE) then used this link to create the right number (but I think that it actually provides UTF-7 encoding) https://stackoverflow.com/a/14601808/1405208

Lastly, I applied the second link to get the right set of UTF-8 number: https://stackoverflow.com/a/14601808/1405208

The Code that I ended up with:

decodeUTF16LE(binaryStr) {
      if (binaryStr.charCodeAt(0) != 255 || binaryStr.charCodeAt(1) != 254) {
        return binaryStr;
      }
      const utf8 = [];
      for (var i = 2; i < binaryStr.length; i += 2) {
        let charcode = binaryStr.charCodeAt(i) | (binaryStr.charCodeAt(i + 1) << 8);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
          utf8.push(0xc0 | (charcode >> 6), 0x80 | (charcode & 0x3f));
        } else if (charcode < 0xd800 || charcode >= 0xe000) {
          utf8.push(0xe0 | (charcode >> 12), 0x80 | ((charcode >> 6) & 0x3f), 0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
          i++;
          // UTF-16 encodes 0x10000-0x10FFFF by
          // subtracting 0x10000 and splitting the
          // 20 bits of 0x0-0xFFFFF into two halves
          charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (charcode & 0x3ff));
          utf8.push(
            0xf0 | (charcode >> 18),
            0x80 | ((charcode >> 12) & 0x3f),
            0x80 | ((charcode >> 6) & 0x3f),
            0x80 | (charcode & 0x3f)
          );
        }
      }
      return String.fromCharCode.apply(String, utf8);
},
NatanS
  • 86
  • 3