1

I got some string data from parameter such as ��.

These are Unicode's UTF-16 surrogate pairs represented as decimal.

How can I convert them to Unicode code points such as "U+1F62C" with the standard library?

Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41

1 Answers1

3

You can easily to it by hand. The algorythm for passing from a high unicode point to the surrogate pair and back is not that hard. Wikipedia page on UTF16 says:

U+10000 to U+10FFFF

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
  • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
  • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

That's just bitwise and, or and shift and can trivially be implemented in C or C++.


As you said you wanted to use the standard library, what you ask for is a conversion from two 16 bits UTF-16 surrogates to one 32 bits unicode code point, so codecvt is your friend, provided you can compile in C++11 mode or higher.

Here is an example processing your values on a little endian architecture:

#include <iostream>
#include <locale>
#include <codecvt>

int main() {
    std::codecvt_utf16<char32_t, 0x10ffffUL,
    std::codecvt_mode::little_endian> cvt;
    mbstate_t state;

    char16_t pair[] = { 55357, 56842 };
    const char16_t *next;

    char32_t u[2];
    char32_t *unext;

    cvt.in(state, (const char *) pair, (const char *) (pair + 2),
        (const char *&) next, u, u+1, unext);

    std::cout << std::hex << (uint16_t) pair[0] << " " << (uint16_t) pair[1]
        << std::endl;
    std::cout << std::hex << (uint32_t) u[0] << std::endl;

    return 0;
}

Output is as expected:

d83d de0a
1f60a
Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • `codecvt` is only required to be present for C++ 2011 or later. If you use Clang or gcc ensure that you use `-std=c++11` flag. If you really cannot use it, you will have to use the *by hand* solution, because the standard library way is `codecvt`. – Serge Ballesta Feb 23 '16 at 08:23
  • Thansk by hand i make Surrogaet to decimal code of Imogi but the now new problem is how to convert decimal code of imogi to Uinicode :) – Byung-Jai Im Feb 24 '16 at 01:14