Unicode surrogate pairs and String.fromCodePoint() — JavaScript

Question

I'm dealing with raw strings containing escape sequences for surrogate halves of UTF astral symbols. (I think I got that lingo right…)

console.log("\uD83D\uDCA9")
// =>

Let's use the above emoji as an example. If I have the surrogate pair (\uD83D\uDCA9) How can I in turn take it's hexadecimal values and turn it into a valid argument for Javascript's String.fromCodePoint() function?

I've tried the following:

const codePoint = ["D83D", "DCA9"].reduce((acc, cur) => {
    return acc += parseInt(cur, 16);
}, 0);

console.log(String.fromCodePoint(codePoint));
// =>  (some weird symbol appears, not !)

PS: I'm familiar with ES6 escape sequences which show hexadecimal values between brackets {…} instead of using surrogate halves. But I need to do this with surrogate pairs!

Any suggestions are greatly appreciated.

Pointy · Accepted Answer · 2018-12-21T13:41:16.187

You can pass a list of values to the function:

console.log(String.fromCodePoint(0xd83d, 0xdca9));

Thus a "valid argument" for String.fromCodePoint() is not necessarily a single value, and indeed for a character that requires a surrogate pair it by definition cannot be a single value. Why? Because each individual numeric source value, as far as String.fromCodePoint() is concerned, must be a 16-bit (2-byte) value. If you could pass bigger single numbers, there would be no need for surrogate pairs!

Edit: much of the above paragraph is inaccurate; the .fromCodePoint() method will accept full Unicode code point values (greater than 16 bits). Of course it still has to split them into surrogate pairs because JavaScript strings are UTF-16, but what it means is that if you happen to have full-size Unicode code points you don't have to split them up yourself, which is nice. However if you do have pairs already, there's really no point combining them yourself because the method also works on the pairs when passed as part of a list of points.

If you have values in an array, you can invoke the function with apply:

var points = [0xd83d, 0xdca9];
console.log(String.fromCodePoint.apply(String, points));

That's a really good answer. Thank you! `Apply()` even works with emojis which consist of multiple emoji, e.g. this one ‍❤️‍ (\uD83D\uDC69\u200D\u2764\uFE0F\u200D\uD83D\uDC69) I'm still a little puzzled by the [MDN documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint) which shows this as an example: `String.fromCodePoint(0x1D306, 0x61, 0x1D307) // "\uD834\uDF06a\uD834\uDF07"` I don't see how the arguments and escape sequence correlate at all. — Audun Olsen, Dec 21 '18 at 08:15
@AudunOlsen that is interesting and somewhat surprising; I'll play around with that. *edit* ah, interesting. According to the spec, and my Firefox, the function *will* accept full 32-bit values (up to the largest actual code point). I still think for your purposes working with the pairs explicitly is probably the easiest thing to do, rather than trying to combine the pairs into a single value only to have `.fromCodePoint()` split them up again. I'll update the answer. — Pointy, Dec 21 '18 at 13:35

Mr Lister · Answer 2 · 2018-12-21T12:41:28.547

2

The solution by Pointy is correct, but to answer your question what goes wrong with your formula, the problem is that you simply add 0xD83D and 0xDCA9, resulting in 0x1B4E6. But that is not how surrogates work; you should have used the proper formula

( (first - 0xD800) << 10) + (second - 0xDC00) + 0x10000

which can be shortened to

(first - 0xD7F7) << 10) + second

See Unicode encodings.

If you do that, you'll get 0x1F4A9.

const codePoint = ["D83D", "DCA9"].reduce((acc, cur) => {
  cur = parseInt(cur, 16); return acc += cur<0xDC00 ? (cur-0xD7F7)<<10 : cur;
  }, 0);

console.log(String.fromCodePoint(codePoint));
// => now outputs !

edited Dec 21 '18 at 12:41

answered Dec 21 '18 at 09:25

Mr Lister

45,515
15
108
150

Very insightful, thanks! Though I now understand that I don't grok hexadecimals. There's two new foreign numbers (to me atleast) in your equation; 0xD7F7 and 0x400. `parseInt(0xD7F7, 10) // => 55287` and `parseInt(0x400, 10) => 1024`, how do these numbers relate? I fail to see the pattern. – Audun Olsen Dec 21 '18 at 10:10
1

Yes, sorry, I should have explained that part better. The first word contains bit 10 to 19 of the codepoint in its lower 10 bits, the second word contains the lower 10 bits, so you have to mask and shift around things. The 0xD7F7 is just what you get if you add together all the additions and subtractions needed. See also the link in my answer. – Mr Lister Dec 21 '18 at 12:46
1

@AudunOlsen it has nothing to do with hexadecimals. The mistake you did can be shown with decimals as well: Let's say the surrogate pair is 12 and 34 (decimal!). What you did is, 12 + 34 = 46. The correct code point though is 1234. So the formula would be (12 * 10 ^ 2) + 34 = 1234. The 10 ^ 2 part is the shift operation <<. Shifting by one just multiplies with the base (in binary 2, in decimal 10): 10 << 1 = 100. – jan.vogt Feb 03 '19 at 15:05

Unicode surrogate pairs and String.fromCodePoint() — JavaScript

2 Answers2