Javascript: unicode character to BYTE based hex escape sequence (NOT surrogates)

Question

In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C:

ie.

becomes: \xF0\x9F\x98\x84 (correct)

NOT javascript surrogates, not \uD83D\uDE04 (wrong)

I cannot figure out the math relationship between the four bytes C wants vs the two surrogates javascript uses. I suspect the algorithm is far more complex than my feeble attempts.

Thanks for any tips.

Bergi · Answer 1 · 2015-08-12T10:36:24.077

Your C code expects an UTF-8 string (the symbol is represented as 4 bytes). The JS representation you see is UTF-16 however (the symbol is represented as 2 uint16s, a surrogate pair).
You will first need to get the (Unicode) code point for your symbol (from the UTF-16 JS string), then build the UTF-8 representation for it from that.

Since ES6 you can use the codePointAt method for the first part, which I would recommend using as a shim even if not supported. I guess you don't want to decode surrogate pairs yourself :-)
For the rest, I don't think there's a library method, but you can write it yourself according to the spec:

function hex(x) {
    x = x.toString(16);
    return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase();
}
var c = "";
console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04"
var cp = c.codePointAt(0);
var bytes = new Uint8Array(4);
bytes[3] = 0x80 | cp & 0x3F;
bytes[2] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[1] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F;
console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"

_{(tested in Chrome)}

score 1 · Answer 2 · answered Aug 01 '15 at 13:09

Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/

I would have never figured out THAT math, wow.

somewhat minified

function UTF8seq(s) {
        var i,c,u=[];
        for (i=0; i < s.length; i++) {
            c = s.charCodeAt(i);
                if (c < 0x80) { u.push(c); }
                else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); }
                else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12),  0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f));  }
                else {  i++;  c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff));
                        u.push(0xf0 | (c >>18),  0x80 | ((c>>12) & 0x3f),  0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }        
        }
        for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); }
        return '\\x'+u.join('\\x');
}

Artem · Accepted Answer · 2015-08-04T00:17:14.563

1

encodeURIComponent does this work:

var input = "\uD83D\uDE04";
var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84

Upd: Actually, C strings can contain digits and letters without escaping, but if you really need to escape them:

function escape(s, escapeEverything) {
    if (escapeEverything) {
        s = s.replace(/[\x10-\x7f]/g, function (s) {
            return "-x" + s.charCodeAt(0).toString(16).toUpperCase();
        });
    }
    s = encodeURIComponent(s).replace(/%/g, "\\x");
    if (escapeEverything) {
        s = s.replace(/\-/g, "\\");
    }
    return s;
}

edited Aug 04 '15 at 00:17

answered Aug 01 '15 at 13:20

Artem

1,773
12
30

This is genius. Very clever use of existing javascript function. Will have to test it for compatibility with all scenarios/sequences. – ck_ Aug 01 '15 at 13:21
1

I found one exception `encodeURIComponent` cannot handle. It refuses to encode the digits 0-9 (for encapsulated emoji). So I have to pre-process before with `replace(/[0-9]/g,function(y){return y.charCodeAt(0).toString(16)})` – ck_ Aug 01 '15 at 17:33
@ck_ Not just the digits `0-9` — e.g. `a-zA-Z` won’t be escaped either. [Use utf8.js](https://mths.be/utf8js) (to encode as UTF-8) combined with [jsesc](https://mths.be/jsesc) with [`escapeEverything: true`](https://github.com/mathiasbynens/jsesc#escapeeverything) (to escape the octets) for a proper solution. – Mathias Bynens Aug 03 '15 at 08:28

Javascript: unicode character to BYTE based hex escape sequence (NOT surrogates)

3 Answers3