C - How to convert wide char Japanese characters to UTF-8?

Question

Trying to convert Japanese characters stored in wide char to UTF-8, in order to store the value in a json file using cJSON library. First tried using wcstombs_s but apparently this does not support Japanese characters:

size_t len = wcslen(japanese[i].name) + 1;
char* japanese_char = malloc(len);
if (japanese_char == NULL) {
    exit(EXIT_FAILURE);
}
size_t sz;
wcstombs_s(&sz, japanese_char, len, japanese[i].name, _TRUNCATE);

Then, based on other answers but also in a successful conversion from json UTF-8 to wide char, tried the opposite function as follows, but the destination buffer dest contains only garbage characters:

size_t wcsChars = wcslen(japanese[i].name);
size_t sizeRequired = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, NULL, 0, NULL, NULL);
char* dest = calloc(sizeRequired, 1);
WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, dest, sizeRequired, NULL, NULL);
free(dest);

The wide char (wchar_t) I am trying to convert is ササササササササササササササササ stored in japanese[i].name (a wchar_t* in a struct). Objective is to use cJSON's cJSON_CreateString to save the value in a UTF-8 encoded json file.

Question: What is the proper way to convert Japanese from wchar_t to UTF-8 char in C (not C++)?

Remy Lebeau · Accepted Answer · 2019-10-04T23:35:36.307

Your wcstombs_s() code is passing the wrong value to the sizeInBytes parameter:

sizeInBytes

The size in bytes of the mbstr buffer.

You are passing in the character count of japanese[i].name, not the allocated byte count of japanese_char. They are not the same value.

Unicode codepoints are encoded in UTF-16 (what wchar_t strings are encoded as on Windows) using 2 or 4 bytes each, and in UTF-8 using 1-4 bytes each, depending on their value. Unicode codepoints in the U+0080..U+FFFF range take up more bytes in UTF-8 than they do in UTF-16, so it is possible that your japanese_char buffer needs to actually be allocated larger than your japanese[i].name data. Just like you can call WideCharToMultiByte() to determine the destination buffer size needed, you can do the same thing with wcstombs_s().

size_t len = 0;
wcstombs_s(&len, NULL, 0, japanese[i].name, _TRUNCATE);
if (len == 0)
    exit(EXIT_FAILURE);
char* japanese_char = malloc(len);
if (!japanese_char)
    exit(EXIT_FAILURE);
wcstombs_s(&len, japanese_char, len, japanese[i].name, _TRUNCATE);
...
free(japanese_char);

Your WideCharToMultiByte() code is not null-terminating dest due to you passing an explicit size to the cchWideChar parameter.

cchWideChar

Size, in characters, of the string indicated by lpWideCharStr. Alternatively, this parameter can be set to -1 if the string is null-terminated. If cchWideChar is set to 0, the function fails.

If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting character string has a terminating null character, and the length returned by the function includes this character.

If this parameter is set to a positive integer, the function processes exactly the specified number of characters. If the provided size does not include a terminating null character, the resulting character string is not null-terminated, and the returned length does not include this character.

cJSON_CreateString() expects a null-terminated char* string. So you need to either:

add +1 to the num parameter of calloc() to account for the missing null terminator.

size_t wcsChars = wcslen(japanese[i].name);
size_t len = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, NULL, 0, NULL, NULL);
char* japanese_char = malloc(len + 1);
if (!japanese_char)
    exit(EXIT_FAILURE);
WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, japanese_char, len, NULL, NULL);
japanese_char[len] = '\0';
...
free(japanese_char);

add +1 to the return value of wcslen(), or set the cchWideChar parameter of WideCharToMultiByte() to -1, to include the null terminator in the output.

size_t wcsChars = wcslen(japanese[i].name) + 1;
size_t len = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, NULL, 0, NULL, NULL);
if (len == 0)
    exit(EXIT_FAILURE);
char* japanese_char = malloc(len);
if (!japanese_char)
    exit(EXIT_FAILURE);
WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, japanese_char, len, NULL, NULL);
...
free(japanese_char);

size_t len = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, -1, NULL, 0, NULL, NULL);
if (len == 0)
    exit(EXIT_FAILURE);
char* japanese_char = malloc(len);
if (!japanese)
    exit(EXIT_FAILURE);
WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, -1, japanese_char, len, NULL, NULL);
...
free(dest);

It works (also I was forgetting to add `,s8` to VS watch window to see UTF8 encoded value, but it was wrong without your fix: I had `ササササササササササササササササ6Cfp`). Once again, thanks a lot for the detailed answer. — evilmandarine, Oct 05 '19 at 22:09

C - How to convert wide char Japanese characters to UTF-8?

1 Answers1