1

How can I get the codepoint from a Unicode value? According the character code table, the Code Point for the pictogram '丂' is 8140, and the Unicode is \u4E02

I made this app on C++, to try to get the CP for a Unicode string value:

#include <iostream>
#include <atlstr.h>
#include <iomanip>
#include <codecvt>

void hex_print(const std::string& s);

int main()
{
    std::wstring test = L"丂"; //assign pictogram directly
    std::wstring test2 = L"\u4E02"; //assign value via Unicode

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
    std::string u8str = conv1.to_bytes(test);
    hex_print(u8str);

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv2;
    std::string u8str2 = conv2.to_bytes(test2);
    hex_print(u8str2);

    return 1;

}

void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for (unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

Output:

00 81 00 40
4e 02

What can I do to get 00 81 00 40, when the value is \u4E02?

Ferrus
  • 15
  • 6
  • Is your locale set to the gb18030 character set? Some experimentation with the invaluable `iconv` on Linux concluded that gb18030 `8140` is utf-16 `4e02`. The 丂 in the shown code would map to octets `0x81 0x40` only if both the editor and the C++ compiler use gb18030. I suspect that you're living in the modern Unicode world, where UTF-8 rules the roost, so you have to do this conversion backwards. 丂 is really `0xe4 0xb8 0x82` in UTF-8. – Sam Varshavchik Feb 02 '23 at 02:14
  • If you want to convert to `utf-8` shouldn't you be using `std::wstring_convert, wchar_t>` or `std::wstring_convert, char16_t>` (depending on your architecture for `wchar_t`? – Galik Feb 02 '23 at 02:31
  • I made the app using GB18030 encoding (doing right click, Open with... C++ Source Code Editor (with encoding) and selected Chinese Simplified (GB18030) - Codepage 54936). Did you mention the Compiler, how can I Change their encoding? And yes I am from America, so here the GB18030 is not common =/ – Ferrus Feb 02 '23 at 02:33

1 Answers1

2

In Windows you can use WideCharToMultiByte

int main()
{
    std::wstring test = L"丂"; //assign pictogram directly
    std::wstring test2 = L"\u4E02"; //assign value via Unicode

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
    std::string u8str = conv1.to_bytes(test);
    hex_print(u8str);

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv2;
    std::string u8str2 = conv2.to_bytes(test2);
    hex_print(u8str2);

    int len = WideCharToMultiByte(54936, 0, test2.c_str(), -1, NULL, 0, NULL, NULL);
    char* strGB18030 = new char[len + 1];
    WideCharToMultiByte(54936, 0, test2.c_str(), -1, strGB18030, len, NULL, NULL);
    hex_print(std::string(strGB18030));
    delete[] strGB18030;

    return 1;

}

output

4e 02
4e 02
81 40
lex
  • 411
  • 3
  • 7