Convert large array of integers to unicode string and then back to array of integers in node.js

Question

I have some data which is represented as an array of integers and can be up to 200 000 elements. The integer value can vary from 0 to 200 000.

To emulate this data (for debugging purposes) I can do the following:

let data = [];
let len = 200000
for (let i = 0; i < len; i++) {
    data[i] = i;
}

To convert this array of integers as an unicode string I perform this:

let dataAsText = data.map((e) => {
    return String.fromCodePoint(e);
}).join('');

When I want to convert back to an array of integers the array appears to be longer:

let dataBack = dataAsText.split('').map((e) => {
    return e.codePointAt(e);
});
console.log(dataBack.length);

How does it come ? What is wrong ?

Extra information:

I use codePointAt/fromCodePoint because it can deal with all unicode values (up to 21 bits) while charCodeAt/fromCharCode fails.
Using, for example, .join('123') and .split('123') will make that the variable dataBack is the same length as data. But this isn't an elegant solution because the size of the string dataAsText will unnecessarily be very large.
If let len is equal or less to 65536 (which is 2^16 or 16 bits max value) then everything works fine. Which is strange ?

EDIT:

I use codePoint because I need to convert the data as unicode text so that the data is short.

More about codePoint vs charCode with an example: If we convert 150000 to a character then back to an integer with codePoint:

console.log(String.fromCodePoint("150000").codePointAt(0));

this gives us 150000 which is correct. Doing the same with charCode fails and prints 18928 (and not 150000):

console.log(String.fromCharCode("150000").charCodeAt(0));

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

That's because higher code point values will yield 2 words, as can be seen in this snippet:

var s = String.fromCodePoint(0x2F804)
console.log(s);  // Shows one character
console.log('length = ', s.length); // 2, because encoding is \uD87E\uDC04

var i = s.codePointAt(0);
console.log('CodePoint value at 0: ', i); // correct

var i = s.codePointAt(1); // Should not do this, it starts in the middle of a sequence!
console.log('CodePoint value at 1: ', i); // misleading

In your code things go wrong when you do split, as there the words making up the string are all split, discarding the fact that some pairs are intended to combine into a single character.

You can use the ES6 solution to this, where the spread syntax takes this into account:

let dataBack = [...dataAsText].map((e, i) => {
   // etc.

Now your counts will be the same.

Example:

// (Only 20 instead of 200000)
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.map(e => String.fromCodePoint(e)).join("");

console.log("String length: " + dataAsText.length);

let dataBack = [...dataAsText].map(e => e.codePointAt(0));

console.log(dataBack);

Surrogates

Be aware that in the range 0 ... 65535 there are ranges reserved for so-called surrogates, which only represent a character when combined with another value. You should not iterate over those expecting that these values represent a character on their own. So in your original code, this will be another source for error.

To fix this, you should really skip over those values:

for (let i = 0; i < len; i++) {
    if (i < 0xd800 || i > 0xdfff) data.push(i);
}

In fact, there are many other code points that do not represent a character.

Using spread is brilliant! But note that spread is not an operator, it's just syntax. (Operators have to have result values.) — T.J. Crowder, Dec 11 '16 at 12:01
`[...dataAsText]` won't help either - consider `data=[0xdbff, 0xdc00]`. — georg, Dec 11 '16 at 12:15
@georg, I don't see a problem with that sequence: `[...'\udbff\udc00'].length` returns 1 for me. Is it different for you? — trincot, Dec 11 '16 at 12:25
Indeed, that is because you start out with surrogates as if they represent individual characters. I added a section on that in my answer. Does that cover it? — trincot, Dec 11 '16 at 12:45

T.J. Crowder · Answer 2 · 2016-12-11T11:46:13.680

I don't think you want charPointAt (or charCodeAt) at all. To convert a number to a string, just use String; to have a single delimited string with all the values, use a delimiter (like ,); to convert it back to a number, use the appropriate one of Number, the unary +, parseInt, or parseFloat (in your case, Number or + probably):

// Only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.join(",");

console.log(dataAsText);

let dataBack = dataAsText.split(",").map(Number);

console.log(dataBack);

If your goal with codePointAt is to keep the dataAsText string short, then you can do that, but you can't use split to recreate the array because JavaScript strings are UTF-16 (effectively) and split("") will split at each 16-bit code unit rather than keeping code points together.

A delimiter would help there too:

// Again, only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.map(e => String.fromCodePoint(e)).join(",");

console.log("String length: " + dataAsText.length);

let dataBack = dataAsText.split(",").map(e => e.codePointAt(0));

console.log(dataBack);

score 1 · Answer 3 · edited May 23 '17 at 12:30

1

I have a feeling split doesn't work with unicode values, a quick test above 65536 shows that they become double the length after splitting

Perhaps look at this post and answers, as they ask a similar question

edited May 23 '17 at 12:30

Community

1
1

answered Dec 11 '16 at 11:39

Joseph Young

2,758
12
23

georg · Answer 4 · 2016-12-12T08:04:14.407

If you're looking for a way to encode a list of integers so that you can safely transmit it over a network, node Buffers with base64 encoding might be a better option:

let data = [];
for (let i = 0; i < 200000; i++) {
    data.push(i);
}

// encoding

var ta = new Int32Array(data);
var buf = Buffer.from(ta.buffer);
var encoded = buf.toString('base64');

// decoding

var buf = Buffer.from(encoded, 'base64');
var ta = new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2);
var decoded = Array.from(ta);

// same?

console.log(decoded.join() == data.join())

Your original approach won't work because not every integer has a corresponding code point in unicode.

UPD: if you don't need the data to be binary-safe, no need for base64, just store the buffer as is:

// saving

var ta = new Int32Array(data);
fs.writeFileSync('whatever', Buffer.from(ta.buffer));

// loading

var buf = fs.readFileSync('whatever');
var loadedData = Array.from(new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2));

// same?

console.log(loadedData.join() == data.join())

This is a very nice alternative ! But saving the encoded data results to an abnormal large file. I save the encoded file by doing: ```fs.writeFileSync('compressed.txt', encoded, 'base64');``` is this the right way ? — dll, Dec 11 '16 at 17:27

Convert large array of integers to unicode string and then back to array of integers in node.js

4 Answers4

Surrogates

Linked