5

In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C:

ie.

becomes: \xF0\x9F\x98\x84 (correct)

NOT javascript surrogates, not \uD83D\uDE04 (wrong)

I cannot figure out the math relationship between the four bytes C wants vs the two surrogates javascript uses. I suspect the algorithm is far more complex than my feeble attempts.

Thanks for any tips.

ck_
  • 3,353
  • 5
  • 31
  • 33

3 Answers3

1

Your C code expects an UTF-8 string (the symbol is represented as 4 bytes). The JS representation you see is UTF-16 however (the symbol is represented as 2 uint16s, a surrogate pair).
You will first need to get the (Unicode) code point for your symbol (from the UTF-16 JS string), then build the UTF-8 representation for it from that.

Since ES6 you can use the codePointAt method for the first part, which I would recommend using as a shim even if not supported. I guess you don't want to decode surrogate pairs yourself :-)
For the rest, I don't think there's a library method, but you can write it yourself according to the spec:

function hex(x) {
    x = x.toString(16);
    return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase();
}
var c = "";
console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04"
var cp = c.codePointAt(0);
var bytes = new Uint8Array(4);
bytes[3] = 0x80 | cp & 0x3F;
bytes[2] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[1] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F;
console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"

(tested in Chrome)

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
1

Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/

I would have never figured out THAT math, wow.

somewhat minified

function UTF8seq(s) {
        var i,c,u=[];
        for (i=0; i < s.length; i++) {
            c = s.charCodeAt(i);
                if (c < 0x80) { u.push(c); }
                else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); }
                else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12),  0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f));  }
                else {  i++;  c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff));
                        u.push(0xf0 | (c >>18),  0x80 | ((c>>12) & 0x3f),  0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }        
        }
        for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); }
        return '\\x'+u.join('\\x');
}
ck_
  • 3,353
  • 5
  • 31
  • 33
1

encodeURIComponent does this work:

var input = "\uD83D\uDE04";
var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84

Upd: Actually, C strings can contain digits and letters without escaping, but if you really need to escape them:

function escape(s, escapeEverything) {
    if (escapeEverything) {
        s = s.replace(/[\x10-\x7f]/g, function (s) {
            return "-x" + s.charCodeAt(0).toString(16).toUpperCase();
        });
    }
    s = encodeURIComponent(s).replace(/%/g, "\\x");
    if (escapeEverything) {
        s = s.replace(/\-/g, "\\");
    }
    return s;
}
Artem
  • 1,773
  • 12
  • 30
  • This is genius. Very clever use of existing javascript function. Will have to test it for compatibility with all scenarios/sequences. – ck_ Aug 01 '15 at 13:21
  • 1
    I found one exception `encodeURIComponent` cannot handle. It refuses to encode the digits 0-9 (for encapsulated emoji). So I have to pre-process before with `replace(/[0-9]/g,function(y){return y.charCodeAt(0).toString(16)})` – ck_ Aug 01 '15 at 17:33
  • @ck_ Not just the digits `0-9` — e.g. `a-zA-Z` won’t be escaped either. [Use utf8.js](https://mths.be/utf8js) (to encode as UTF-8) combined with [jsesc](https://mths.be/jsesc) with [`escapeEverything: true`](https://github.com/mathiasbynens/jsesc#escapeeverything) (to escape the octets) for a proper solution. – Mathias Bynens Aug 03 '15 at 08:28