3

Ok, I have this:

AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");

Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:

WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);

WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.

If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?

Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.

Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?

Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
user206334
  • 850
  • 1
  • 8
  • 18
  • Does your compiler support C99's \uXXXX and \UXXXXXXXX escapes for unicode? While unreadable, it's certainly more portable since only the basic C character set is needed. – Jens Apr 06 '13 at 09:35
  • @Jens Visual C++ is not C99 compliant, only ANSI, but I can try. – user206334 Apr 06 '13 at 09:40
  • This sounds a bit confusing and I'm not 100% sure on what you're trying to do here. Do you just want to keep one string literal for both narrow and wide char strings? Or do you want to mix/match both while keeping your encoding working? – Mario Apr 06 '13 at 09:42
  • @Mario I want to make strings in source code portable (independent on the encoding file was saved as). If I save it under UTF-16 or UTF-8 with BOM printf stops working, if I save it under UTF-8 without BOM printf works but WriteConsoleW stops cause each of them expects differently encoded strings (I suppose). – user206334 Apr 06 '13 at 09:47
  • Tried to explain some of your issues below. If you're still unable to get your output right, let me know in a comment under the answer and I'll try to have a closer look. – Mario Apr 06 '13 at 09:58
  • @Jens It seems Visual C++ does not support unicode escapes – user206334 Apr 06 '13 at 10:39
  • Once a stream such as standard output is 'imbued' as a narrow stream by the call to `printf()`, you can't do wide I/O on it with `wprintf()`, or vice versa, per the C standard (1999 at any rate; the wide I/O was added in the 1995 amendment to the 1989 standard). – Jonathan Leffler Apr 06 '13 at 10:52

2 Answers2

0

I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:

Now, the issue is that depending on the encoding file was saved as only some these works.

Of course, because the encoding determines how to Interpret the string literals.

wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).

I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.

If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:

That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.

WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.

I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.

If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?

This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.


Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).

Edit:

To use an analogy:

You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).

Mario
  • 35,726
  • 5
  • 62
  • 78
  • About stdout and wprint: http://stackoverflow.com/questions/15827607/writeconsolew-wprintf-and-unicode Did you really get those umlauts using wprint? Will try your suggestions and respond. – user206334 Apr 06 '13 at 10:01
  • Yes, we're only using `wprintf()`. Although, to be honest, I noticed there can be issues if you try to mix `printf()` and `wprintf()` within the same program. – Mario Apr 06 '13 at 10:03
  • Yes, and and the answer is that stdout has orientation either for one byte char or wide-char. There is function to change orientation - fwide() or you can use freopen() and the orientation gets set after the first call to either printf or wprintf. Microsoft stdout seems only to have byte orientation and my wprintf outputs nothing. If you're interested, try printing my string at the previous link. – user206334 Apr 06 '13 at 10:09
  • So we're talking about two different issues here? You want to use one "fits all" encoding (or a "don't care" one?) and mix/match narrow and wide char output? – Mario Apr 06 '13 at 10:15
  • Consider there are two functions which require two different inputs: enterUTF16string() and enterUTF8string() and I want to use literals in those. It seems I can't use both in the same source file. I thought prefixing with L would solve it, but it doesn't work. – user206334 Apr 06 '13 at 10:16
  • I have the answer what happens with WriteConsoleW(L"qwertyž") without BOM set. Firsly, it is stored as UTF-8 encoded text: 0x71 0x77 0x65 0x72 0x74 0x79 0xC5 0xBE. Later compiler processes L wide char directive and simply prefix 00, so we have 0x0071, ..., however it fails with converting 0xC5 0xBE into 0x017E (UTF-16). Instead it created two UTF-16 characters 0x00C5, 0x00BE. – user206334 Apr 06 '13 at 10:27
  • With BOM set it does this correctly. I get 0x017E inside binary file. – user206334 Apr 06 '13 at 10:34
  • Yes, that's what I expected. As for your "two functions" problem, I think you can solve this using templates and a macro. Adding that a bit later when I've got some time. – Mario Apr 06 '13 at 16:58
  • It's still a mystery for me why without BOM printf works and with BOM it does not. – user206334 Apr 06 '13 at 23:34
  • I found an answer in another SO post. You can take a look. – user206334 Apr 09 '13 at 09:53
0

The answer is here.

Quoting:

It is impossible for the compiler to intermix UTF-8 and UTF-16 strings into the compiled output! So you have to decide for one source code file:

  • either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
  • or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
  • 7-bit ASCII characters are not involved and can be used with or without L prefix

The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

Community
  • 1
  • 1
user206334
  • 850
  • 1
  • 8
  • 18