How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

Question

I got some string data from parameter such as &#55357;&#56842;.

These are Unicode's UTF-16 surrogate pairs represented as decimal.

How can I convert them to Unicode code points such as "U+1F62C" with the standard library?

I've tried to help with the formatting of your question, thanks for fixing my fix. Feel free to continue doing so. As for the standard library, I'm afraid it won't be enough. — Quentin Pradet, Feb 22 '16 at 07:37

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

You can easily to it by hand. The algorythm for passing from a high unicode point to the surrogate pair and back is not that hard. Wikipedia page on UTF16 says:

U+10000 to U+10FFFF

0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

That's just bitwise and, or and shift and can trivially be implemented in C or C++.

As you said you wanted to use the standard library, what you ask for is a conversion from two 16 bits UTF-16 surrogates to one 32 bits unicode code point, so codecvt is your friend, provided you can compile in C++11 mode or higher.

Here is an example processing your values on a little endian architecture:

#include <iostream>
#include <locale>
#include <codecvt>

int main() {
    std::codecvt_utf16<char32_t, 0x10ffffUL,
    std::codecvt_mode::little_endian> cvt;
    mbstate_t state;

    char16_t pair[] = { 55357, 56842 };
    const char16_t *next;

    char32_t u[2];
    char32_t *unext;

    cvt.in(state, (const char *) pair, (const char *) (pair + 2),
        (const char *&) next, u, u+1, unext);

    std::cout << std::hex << (uint16_t) pair[0] << " " << (uint16_t) pair[1]
        << std::endl;
    std::cout << std::hex << (uint32_t) u[0] << std::endl;

    return 0;
}

Output is as expected:

d83d de0a
1f60a

`codecvt` is only required to be present for C++ 2011 or later. If you use Clang or gcc ensure that you use `-std=c++11` flag. If you really cannot use it, you will have to use the *by hand* solution, because the standard library way is `codecvt`. — Serge Ballesta, Feb 23 '16 at 08:23
Thansk by hand i make Surrogaet to decimal code of Imogi but the now new problem is how to convert decimal code of imogi to Uinicode :) — Byung-Jai Im, Feb 24 '16 at 01:14

How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

1 Answers1

U+10000 to U+10FFFF

Linked