Convert codepoint to wchar_t in C

Question

If I know the unicode codepoint of this 2 chinese character 你好 in str

How can I convert this char * str codepoint to chinese character and assign it to wchar_t * wstr ?

char * str = "4F60 597D";
wchar_t * wstr;

I know that I can directly assign like this and problem solved.

wchar_t * wstr = L"\u4F60\u597D";

But my problem is more complicated than that, my situation does not allow that.

How can I do the conversion from literal codepoint to wchar_t * ?

Thanks.

I am using MS Visual C with charset set to MBCS, assume that I cannot use UNICODE charset.

UPDATE : Sorry, just corrected the wchar_t wstr to wchar_t * wstr

UPDATE The char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好

char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";    
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize  = 0;
_locale_t local = _create_locale( LC_ALL , "Chinese");
_mbstowcs_s_l(&convertedSize, wstr, len, str, _TRUNCATE, local);
MessageBoxW( NULL, wstr , (LPCWSTR)L"Hello", MB_OK);

Why is the MessageBox printing out Japanese character ? Instead of chinese ? What is the right locale name to use ?

This question in rather confusing. What exactly do you have right now? A `char*` with lots of hex codes that represent UNICODE code points? And in what encoding should the MBCS be? — RedX, Apr 12 '13 at 07:15
@RedX : Yes, I have a char* that have lots of unicode codepoint separated by space. I need to convert it to wchar_t * . Sorry I don't understand "what encoding should be the MBCS be" . — William, Apr 12 '13 at 07:20
More to the point, you have lots of hex chars in a space-separated string that *represent* unicode code points. I honestly don't see a simple "call this" solution to your problem. Each quad needs to be converted to an unsigned 16-bit value, then translated to a `wchar_t`. Were this in C++ a fairly elegant solution would be doable using standard algorithms and a pair of containers, but in C, you may have to get your hands dirty. — WhozCraig, Apr 12 '13 at 07:26
@WhozCraig : Hi, maybe is there a library that can help on this ? Btw, I don't mind to get dirty on my hand, so far I think this is the way to solve my problem. Would you be able to show me a basic code for this ? Or maybe point out the function to convert the codepoint to 16bit value — William, Apr 12 '13 at 07:31
Yeah I just perused the wchar.h library and the multi-byte and wide-char facilities in the standard library. Nothing pops up as a simple solution to the conversion you seek. I perish the thought of a `strtok_r()` loop to perform the conversion, but it may be all you have. It is a somewhat unusual representation of Unicode codepoints that you're converting. — WhozCraig, Apr 12 '13 at 07:53
@WhozCraig : Hi, I have updated my question, I need to know if I am using the mbstowcs_s function correctly ? Thanks. The content of char * str can content anything I want, what I am limited is the datatype has to be char * str. — William, Apr 12 '13 at 08:03
It looks correct, the only thing I see as a potential problem is not setting the locale prior to the invoke, or not using the `mbstowcs_s_l()` function that allows you to specify you locale immediately with the conversion invoke. See the [Locale-realted CRT functions](http://msdn.microsoft.com/en-us/library/wyzd2bce.aspx) and their notes for more information. Honestly though, if you're using Windows and *know* you will be, *and* you can control the input (i.e. you can say it will always be a UTF8-encoded string), I would use `MultiByteToWideChar(CP_UTF8,...)` but thats pure winapi. — WhozCraig, Apr 12 '13 at 08:14
@WhozCraig : Hi, I have updated the code to include the locale, now it's a little better but still incorrect, the MessageBox shows Japanese character instead of Chinese. What maybe the right locale to use ? Thanks — William, Apr 12 '13 at 08:47
You are still trying to solve [this problem](http://stackoverflow.com/questions/15962259/chinese-character-in-source-code-when-utf-8-settings-cant-be-used), right? In that case, I don't think it's a good idea to arrange the code points (or hex numbers) in a string, separated by space. That only complicates the matter. — jogojapan, Apr 12 '13 at 08:49
On a separate note, you should also look at the questions displayed on the right as "related". Some of this may provide the solution you are looking for. — jogojapan, Apr 12 '13 at 08:50
@jogojapan : Hi yes, I have new update for this question. I think it's getting close, it just the MessageBox does not display correctly. — William, Apr 12 '13 at 09:01

score 0 · Answer 1 · answered Apr 12 '13 at 08:17

I can think about this function:

#define GetValFromHex(x) (x > '9' ? x-'A'+10 : x - '0')

wchar_t GetChineesChar(const char* strInput)
{
    wchar_t result = 0;
    LPBYTE ptr = (LPBYTE)&result;

    ptr[0] = GetValFromHex(strInput[2]) * 16 + GetValFromHex(strInput[3]);
    ptr[1] = GetValFromHex(strInput[6]) * 16 + GetValFromHex(strInput[7]);

    return result;
}

wchatr_t* GetChineesString(const char* strInput)
{
    size_t  len = strlen(strInput) / 8;
    wchar_t* returnVal = new wchar_t[len];
    for (int i = 0; i < len; i++)
    {
         returnVal[i] = GetChineesChar(&strInput[i*8]);
    }
    return returnVal;
}

Then you should just call GetChineesString(); ofcourse you can add more validation to check the first two chars are \x and fivth and sixth chars are \x too before moving forward. but this is a start point for more robust code. this is not robust and not tested too.

Edit: I am assuming all hex values are Upper Case.

Convert codepoint to wchar_t in C

1 Answers1