I got some string data from parameter such as ��
.
These are Unicode's UTF-16 surrogate pairs represented as decimal.
How can I convert them to Unicode code points such as "U+1F62C" with the standard library?
I got some string data from parameter such as ��
.
These are Unicode's UTF-16 surrogate pairs represented as decimal.
How can I convert them to Unicode code points such as "U+1F62C" with the standard library?
You can easily to it by hand. The algorythm for passing from a high unicode point to the surrogate pair and back is not that hard. Wikipedia page on UTF16 says:
That's just bitwise and, or and shift and can trivially be implemented in C or C++.
As you said you wanted to use the standard library, what you ask for is a conversion from two 16 bits UTF-16 surrogates to one 32 bits unicode code point, so codecvt
is your friend, provided you can compile in C++11 mode or higher.
Here is an example processing your values on a little endian architecture:
#include <iostream>
#include <locale>
#include <codecvt>
int main() {
std::codecvt_utf16<char32_t, 0x10ffffUL,
std::codecvt_mode::little_endian> cvt;
mbstate_t state;
char16_t pair[] = { 55357, 56842 };
const char16_t *next;
char32_t u[2];
char32_t *unext;
cvt.in(state, (const char *) pair, (const char *) (pair + 2),
(const char *&) next, u, u+1, unext);
std::cout << std::hex << (uint16_t) pair[0] << " " << (uint16_t) pair[1]
<< std::endl;
std::cout << std::hex << (uint32_t) u[0] << std::endl;
return 0;
}
Output is as expected:
d83d de0a
1f60a