wchar_t* with UTF8 chars in MSVC

Question

I am trying to format wchar_t* with UTF-8 characters using vsnprintf and then printing the buffer using printf.

Given the following code:

/*
  This code is modified version of KB sample:
  https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm

  The usage of `setlocale` is required by my real-world scenario,
  but can be modified if that fixes the issue.
*/

#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>

#ifdef MSVC
#include <windows.h>
#endif

void vout(char *string, char *fmt, ...)
{
   setlocale(LC_CTYPE, "en_US.UTF-8");
   va_list arg_ptr;

   va_start(arg_ptr, fmt);
   vsnprintf(string, 100, fmt, arg_ptr);
   va_end(arg_ptr);
}

int main(void)
{
   setlocale(LC_ALL, "");
#ifdef MSVC
   SetConsoleOutputCP(65001); // with or without; no dice
#endif

   char string[100];

   wchar_t arr[] = { 0x0119 };
   vout(string, "%ls", arr);
   printf("This string should have 'ę' (e with ogonek / tail) after colon:  %s\n", string);
   return 0;
}

I compiled with gcc v5.4 on Ubuntu 16 to get the desired output in BASH:

gcc test.c -o test_vsn
./test_vsn
This string should have 'ę' (e with ogonek / tail) after colon:  ę

However, on Windows 10 with CL v19.10.25019 (VS 2017), I get weird output in CMD:

cl test.c /Fetest_vsn /utf-8
.\test_vsn
This string should have 'T' (e with ogonek / tail) after colon:  e

(the ę before colon becomes T and after the colon is e without ogonek)

Note that I used CL's new /utf-8 switch (introduced in VS 2015), which apparently has no effect with or without. Based on their blog post:

There is also a /utf-8 option that is a synonym for setting “/source-charset:utf-8” and “/execution-charset:utf-8”.

(my source file already has BOM / utf8'ness and execution-charset is apparently not helping)

What could be the minimal amount of changes to the code / compiler switches to make the output look identical to that of gcc?

On Windows, `printf()` (and the console in general) does not support UTF-8. You could use `WideCharToMultiByte()` (or equivalent) to convert UTF-16 encoded `wchar_t` data to UTF-8, but that is still no guarantee the console will display it correctly. You really should be writing Unicode data to the console using Unicode console APIs, like the Win32 `WriteConsoleW()` function, or `std::wcout` in C++. There are plenty of questions on StackOverflow on how to output Unicode data to a Windows console. Your reputation is high enough that you should have known to do some research before asking. — Remy Lebeau, Aug 02 '17 at 00:06
You can also run the PowerShell IDE and navigate to your program's directory, then run your program. — , Aug 02 '17 at 00:07
@RemyLebeau, thanks. I will give `WideCharToMultiByte()` and other Unicode console APIs a try. I did some research but got lost in product versioning (e.g. since VS2015, the inclusion of vsnprintf OOTB etc.). Will read some more. :) — vulcan raven, Aug 02 '17 at 00:19
@ChronoKitsune, the same executable outputs `This string should have 'ÄT' (e with ogonek / tail) after colon: e` in PS. — vulcan raven, Aug 02 '17 at 00:21
@vulcanraven: If you write to the console using Unicode APIs, you *don't* need to use `WideCharToMultiByte()` to convert your Unicode data to another encoding. Unicode APIs on Windows take `wchar_t` data as input. — Remy Lebeau, Aug 02 '17 at 00:21
@vulcanraven: this is the kind of situation where you should wrap your logging code in a custom function that takes Unicode strings as input and then writes them to the console according to the needs of the underlying platform - as UTF-8 on Ubunto, as UTF-16 on Windows, etc. — Remy Lebeau, Aug 02 '17 at 00:28
If the locale setting appears wrong the first thing to check is the return value of `setlocale(LC_CTYPE, "en_US.UTF-8");` Was that `NULL`? "If the selection cannot be honored, the setlocale function returns a null pointer and the program’s locale is not changed." — chux - Reinstate Monica, Aug 02 '17 at 02:32
@RemyLebeau The lack of support with "On Windows, printf() (...) does not support UTF-8." is a compiler issue, not an OS issue. — chux - Reinstate Monica, Aug 02 '17 at 02:34
@chux: `printf` is a C runtime function, not a compiler function, but either way it ultimately calls platform APIs for the actual output, and the console APIs on Windows simply do not support UTF-8 — Remy Lebeau, Aug 02 '17 at 02:36
@RemyLebeau Hmm, `int main() { printf("%p\n", setlocale(LC_CTYPE, "en_US.UTF-8")); putwchar(0x0119); puts(""); }` prints `0x1801fca20 ę` on my windows machine `cmd.exe` console. Certainly looks like UTF-8 support. — chux - Reinstate Monica, Aug 02 '17 at 03:10
@chux: that does not prove UTF-8 is being used. `putwchar()` takes a `wchar_t` as input and is simply required to write it to the console, it very well could be using Microsoft's `wchar_t` based console API, which would be natural on Windows, no conversion to UTF-8 needed. — Remy Lebeau, Aug 02 '17 at 03:16
@RemyLebeau I re-direct the program's output to a file with `foo > t` and then dumped the file's 15-byte contents: `30 78 31 38 30 31 66 63 61 32 30 0a c4 99 0a` The `c4 99` is certainly [UTF-8](http://www.fileformat.info/info/unicode/char/0119/index.htm) for LATIN SMALL LETTER E WITH OGONEK — chux - Reinstate Monica, Aug 02 '17 at 04:11

score 0 · Answer 1 · answered Aug 02 '17 at 12:47

Based on @RemyLebeau's comment, I modified the code to use w variant of the printf APIs to get the output identical with msvc on Windows, matching that of gcc on Unix.

Additionally, instead of changing codepage, I have now used _setmode (FILE translation mode).

/*
  This code is modified version of KB sample:
  https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm

  The usage of `setlocale` is required by my real-world scenario,
  but can be modified if that fixes the issue.
*/

#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>

#ifdef _WIN32
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
#endif

void vout(wchar_t *string, wchar_t *fmt, ...)
{
   setlocale(LC_CTYPE, "en_US.UTF-8");
   va_list arg_ptr;

   va_start(arg_ptr, fmt);
   vswprintf(string, 100, fmt, arg_ptr);
   va_end(arg_ptr);
}

int main(void)
{
   setlocale(LC_ALL, "");
#ifdef _WIN32
   int oldmode = _setmode(_fileno(stdout), _O_U16TEXT);
#endif

   wchar_t string[100];

   wchar_t arr[] = { 0x0119, L'\0' };
   vout(string, L"%ls", arr);
   wprintf(L"This string should have 'ę' (e with ogonek / tail) after colon:  %ls\r\n", string);

#ifdef _WIN32
   _setmode(_fileno(stdout), oldmode);
#endif
   return 0;
}

Alternatively, we can use fwprintf and provide stdout as first argument. To do the same with fwprintf(stderr,format,args) (or perror(format, args)), we would need to _setmode the stderr as well.

wchar_t* with UTF8 chars in MSVC

1 Answers1

Linked