-1

I am based on this article https://kishuagarwal.github.io/unicode.html


I took for example: UTF-16 code point 0x1F9F0

In hexa:

0x1F9F0

In binary:

0001 1111 1001 1111 0000

Fallowing the explanation from article, should i have some thing like that:

1101 10XX XXXX XXXX 1101 11XX XXXX XXXX

Which populate from the bits from do code point, give me

binary:

1101 1000 0111 1110 1101 1101 1111 0000

hexa:

\uD87E \uDDF0

But in this page correct value is:

hexa:

\uD83E\uDDF0

binary:

1101 1000 0011 1110 1101 1101 1111 0000

So...

      my hexa: \uD87E \uDDF0
 correct hexa: \uD83E \uDDF0

I have single bit misplaced, and I cant figure out why...

user85421
  • 28,957
  • 10
  • 64
  • 87
Bruno Rozendo
  • 317
  • 5
  • 17

1 Answers1

1

Converting 0x1F9F0 (0001 1111 1001 1111 0000)

From the article you posted, we follow the part:

For the unicode codepoints from U+010000 to U+10FFFF, ...

and the first step, which you probably missed:

Firstly 0x010000 is subtracted from the code point, giving us a 20-bit number in the range 0x000000 to 0x0FFFFF.

that is, 0x0F9F0 (0000 1111 1001 1111 0000)

UTF-16 divides these range into two buckets 0xD800...0xDBFF and 0xDC00...0xDFFF (let’s call them A and B ) where each bucket has 10 free bits and 6 fixed bits(shown in grey in the image).

or, as you already posted: 1101 10XX XXXX XXXX and 1101 11XX XXXX XXXX

The 20-bit number that we got above after subracting, is now divided into two parts of 10-bit each. The first 10-bits are used to the fill the 10 free bits of A while the remaining 10-bits are used to fill the 10 free bits of B.

resulting in 1101 1000 0011 1110 and 1101 1101 1111 00000 or 0xD83E 0xDDF0 - as expected.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
user85421
  • 28,957
  • 10
  • 64
  • 87
  • True, my mistake was I read wrong `For the unicode codepoints from U+010000 to U+10FFFF` insted I read: `For the unicode codepoints from U+010000 to U+10FFF`, – Bruno Rozendo Jul 15 '19 at 12:43