1

I have 3rd party code which punycodes strings (escapes and unescapes). As Unicode input/output, it uses 32-bit Unicode strings (uint32_t-based), not 16-bit. My own input/output is BSTR (UTF 16-bit). How should I convert between 32-bit Unicode char array and BSTR (both directions)?

The code should work in Visual C++ 6.0 and later versions.

Alex
  • 2,469
  • 3
  • 28
  • 61
  • 2
    You'll need a 3rd party library, the [`` header](http://en.cppreference.com/w/cpp/locale/codecvt_utf16) isn't available until VS2010 – Mgetz Aug 31 '17 at 13:51
  • 2
    You can't ask for a solution for *obsolete* versions of the language. VC 6 is 20 years old. C++ didn't even support Unicode back then. The current *language* version is C++ 17. The language got Unicode support and literals in C++ 11 with `char16_t`, `char32_t`. C++ 14 added support in STL with u16string, u32string. No *standard-compliant* solution is going to work with a 20 year old compiler. Check [String and Character literals](https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp) for reference – Panagiotis Kanavos Aug 31 '17 at 13:59
  • 2
    @Mgetz it's not just a matter of when `codecvt` was added. After all, `BSTR` is a UTF16 string with a length. The language itself didn't have Unicode support 20 years ago. It has now. `uint32_t` is the *wrong* type to use now. And I really don't think that VC++ 6 was able to work reliably with UTF32 20 years ago. Character set conversions are delegated to the OS. I think VC++ 6 didn't make the appropriate calls for UTF32 back then simply because they didn't exist – Panagiotis Kanavos Aug 31 '17 at 14:04
  • Thank you for replies. I just wasn't able to find any punycode implementation which does not use uint32_t. I'd happy to use the implementation which is 16-bit based (or implemented more correctly, not using uint32_t) if it existed. – Alex Aug 31 '17 at 14:14
  • @PanagiotisKanavos C++ doesn't know what Unicode is other than `` it's all bytes in a `std::string` or `std::wstring` – Mgetz Aug 31 '17 at 14:14
  • 1
    @Mgetz, no, it got Unicode strings, literals etc in C++11. It *does* have `char16_t`, `char32_t`, `u16string` and `u32string` now. Check [String and Character Literals](https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp). What it *still* doesn't have is UTF8 support. `char`, `string` and good will are used for UTF8, hoping that that *other* developer doesn't use the wrong codepage to read a localized file – Panagiotis Kanavos Aug 31 '17 at 14:18
  • 2
    @Alex: Your best option is to use a 3rd party Unicode library, like [libiconv](https://www.gnu.org/software/libiconv/) or [ICU](http://site.icu-project.org/). But converting between UTF16<->UTF32 is very trivial to implement manually, even in old C++ versions. It would not be difficult at all to manually convert a UTF-16 string to a UTF-32 string and vice versa. – Remy Lebeau Aug 31 '17 at 17:38
  • @RemyLebeau Will try that, thanks! – Alex Aug 31 '17 at 18:08

1 Answers1

1

UTF16 is same as UTF32 for characters less than 0xFFFF. You can use the following conversion to display UTF-32 codes in Windows.

Note, this is based on Wikipedia UTF16 article. I didn't add any error checks, it expects valid codes.

void get_utf16(std::wstring &str, int ch32)
{
    const int mask = (1 << 10) - 1;
    if(ch32 < 0xFFFF)
    {
        str.push_back((wchar_t)ch32);
    }
    else
    {
        ch32 -= 0x10000;
        int hi = (ch32 >> 10) & mask;
        int lo = ch32 & mask;

        hi += 0xD800;
        lo += 0xDC00;

        str.push_back((wchar_t)hi);
        str.push_back((wchar_t)lo);
    }
}

For example the following code should display a smiley face in Windows 10:

std::wstring str;
get_utf16(str, 0x1f600);
::MessageBoxW(0, str.c_str(), 0, 0);


Edit:

Obtaining UTF-16 from array of UTF-32 code points, and the reverse operation:

UTF-16 string can be one wchar_t character long (2 bytes per code point), or 2 wchar_t characters joined together (4 bytes per code point). If the first character is between 0xD800 and 0xE000 that indicates 4 bytes per code point.

bool get_str_utf16(std::wstring &dst, const std::vector<unsigned int> &src)
{
    const int mask = (1 << 10) - 1;
    for(size_t i = 0; i < src.size(); i++)
    {
        unsigned int ch32 = src[i];
        ////check for invalid range
        //if(ch32 > 0x10FFFF || (ch32 >= 0xD800 && ch32 < 0xE000))
        //{
        //  cout << "invalid code point\n";
        //  return false;
        //}

        if(ch32 > 0x10000)
        {
            ch32 -= 0x10000;
            int hi = (ch32 >> 10) & mask;
            int lo = ch32 & mask;
            hi += 0xD800;
            lo += 0xDC00;
            dst.push_back((wchar_t)hi);
            dst.push_back((wchar_t)lo);
        }
        else
        {
            dst.push_back((wchar_t)ch32);
        }
    }
    return true;
}

void get_str_utf32(std::vector<unsigned int> &dst, const std::wstring &src)
{
    for(unsigned i = 0; i < src.size(); i++)
    {
        const wchar_t ch = src[i];
        if(ch >= 0xD800 && ch < 0xE000)
        {
            //this character is joined with the next character
            if(i < src.size() - 1)
            {
                unsigned int hi = src[i]; i++;
                unsigned int lo = src[i];
                hi -= 0xD800;
                lo -= 0xDC00;
                unsigned int u32 = 0x10000 + (hi << 10) + lo;
                dst.push_back(u32);
            }
        }
        else
        {
            dst.push_back(ch);
        }
    }
}

Example:

std::wstring u16 = L"123456";

std::vector<unsigned int> u32;
get_str_utf32(u32, u16);
cout << "\n";

cout << "UTF-32 result: ";
for(auto e : u32)
    printf("0x%X ", e);
cout << "\n";

std::wstring test;
get_str_utf16(test, u32);
MessageBox(0, test.c_str(), (u16 == test) ? L"OK" : L"ERROR", 0);
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • Thanks a lot! Now I will need to figure out how to do 16-bit BSTR to 32-bit (reverse conversion). Hopefully with the links you provided it should be fairly easy to implement the algorithm. – Alex Sep 01 '17 at 10:31
  • Great. You are welcome. Make sure to run some torture tests. I have used the first part of above code in my apps, and it seems to be stable. – Barmak Shemirani Sep 03 '17 at 21:18