2

If I know the unicode codepoint of this 2 chinese character 你好 in str

How can I convert this char * str codepoint to chinese character and assign it to wchar_t * wstr ?

char * str = "4F60 597D";
wchar_t * wstr;

I know that I can directly assign like this and problem solved.

wchar_t * wstr = L"\u4F60\u597D";

But my problem is more complicated than that, my situation does not allow that.

How can I do the conversion from literal codepoint to wchar_t * ?

Thanks.

I am using MS Visual C with charset set to MBCS, assume that I cannot use UNICODE charset.

UPDATE : Sorry, just corrected the wchar_t wstr to wchar_t * wstr

UPDATE The char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好

char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";    
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize  = 0;
_locale_t local = _create_locale( LC_ALL , "Chinese");
_mbstowcs_s_l(&convertedSize, wstr, len, str, _TRUNCATE, local);
MessageBoxW( NULL, wstr , (LPCWSTR)L"Hello", MB_OK);

Why is the MessageBox printing out Japanese character ? Instead of chinese ? What is the right locale name to use ?

William
  • 5,526
  • 6
  • 20
  • 42
  • This question in rather confusing. What exactly do you have right now? A `char*` with lots of hex codes that represent UNICODE code points? And in what encoding should the MBCS be? – RedX Apr 12 '13 at 07:15
  • @WhozCraig : Yes, sorry, I just corrected that. – William Apr 12 '13 at 07:19
  • @RedX : Yes, I have a char* that have lots of unicode codepoint separated by space. I need to convert it to wchar_t * . Sorry I don't understand "what encoding should be the MBCS be" . – William Apr 12 '13 at 07:20
  • More to the point, you have lots of hex chars in a space-separated string that *represent* unicode code points. I honestly don't see a simple "call this" solution to your problem. Each quad needs to be converted to an unsigned 16-bit value, then translated to a `wchar_t`. Were this in C++ a fairly elegant solution would be doable using standard algorithms and a pair of containers, but in C, you may have to get your hands dirty. – WhozCraig Apr 12 '13 at 07:26
  • @WhozCraig : Hi, maybe is there a library that can help on this ? Btw, I don't mind to get dirty on my hand, so far I think this is the way to solve my problem. Would you be able to show me a basic code for this ? Or maybe point out the function to convert the codepoint to 16bit value – William Apr 12 '13 at 07:31
  • Yeah I just perused the wchar.h library and the multi-byte and wide-char facilities in the standard library. Nothing pops up as a simple solution to the conversion you seek. I perish the thought of a `strtok_r()` loop to perform the conversion, but it may be all you have. It is a somewhat unusual representation of Unicode codepoints that you're converting. – WhozCraig Apr 12 '13 at 07:53
  • @WhozCraig : Hi, I have updated my question, I need to know if I am using the mbstowcs_s function correctly ? Thanks. The content of char * str can content anything I want, what I am limited is the datatype has to be char * str. – William Apr 12 '13 at 08:03
  • It looks correct, the only thing I see as a potential problem is not setting the locale prior to the invoke, or not using the `mbstowcs_s_l()` function that allows you to specify you locale immediately with the conversion invoke. See the [Locale-realted CRT functions](http://msdn.microsoft.com/en-us/library/wyzd2bce.aspx) and their notes for more information. Honestly though, if you're using Windows and *know* you will be, *and* you can control the input (i.e. you can say it will always be a UTF8-encoded string), I would use `MultiByteToWideChar(CP_UTF8,...)` but thats pure winapi. – WhozCraig Apr 12 '13 at 08:14
  • @WhozCraig : Hi, I have updated the code to include the locale, now it's a little better but still incorrect, the MessageBox shows Japanese character instead of Chinese. What maybe the right locale to use ? Thanks – William Apr 12 '13 at 08:47
  • You are still trying to solve [this problem](http://stackoverflow.com/questions/15962259/chinese-character-in-source-code-when-utf-8-settings-cant-be-used), right? In that case, I don't think it's a good idea to arrange the code points (or hex numbers) in a string, separated by space. That only complicates the matter. – jogojapan Apr 12 '13 at 08:49
  • On a separate note, you should also look at the questions displayed on the right as "related". Some of this may provide the solution you are looking for. – jogojapan Apr 12 '13 at 08:50
  • @jogojapan : Hi yes, I have new update for this question. I think it's getting close, it just the MessageBox does not display correctly. – William Apr 12 '13 at 09:01

1 Answers1

0

I can think about this function:

#define GetValFromHex(x) (x > '9' ? x-'A'+10 : x - '0')

wchar_t GetChineesChar(const char* strInput)
{
    wchar_t result = 0;
    LPBYTE ptr = (LPBYTE)&result;

    ptr[0] = GetValFromHex(strInput[2]) * 16 + GetValFromHex(strInput[3]);
    ptr[1] = GetValFromHex(strInput[6]) * 16 + GetValFromHex(strInput[7]);

    return result;
}

wchatr_t* GetChineesString(const char* strInput)
{
    size_t  len = strlen(strInput) / 8;
    wchar_t* returnVal = new wchar_t[len];
    for (int i = 0; i < len; i++)
    {
         returnVal[i] = GetChineesChar(&strInput[i*8]);
    }
    return returnVal;
}

Then you should just call GetChineesString(); ofcourse you can add more validation to check the first two chars are \x and fivth and sixth chars are \x too before moving forward. but this is a start point for more robust code. this is not robust and not tested too.

Edit: I am assuming all hex values are Upper Case.

Mahmoud Fayez
  • 3,398
  • 2
  • 19
  • 36