15

In a C program I'm using wprintf to print Unicode (UTF-16) text in a Windows console. This works fine, but when the output of the program is redirected to a log file, the log file has a corrupted UTF-16 encoding. When redirection is done in a Windows Command Prompt, all line breaks are encoded as a narrow ASCII line break (0d0a). When redirection is done in PowerShell, null characters are inserted.

Is it possible to redirect the output to a proper UTF-16 log file?

Example program:

#include <stdio.h>
#include <windows.h>
#include <fcntl.h>
#include <io.h>

int main () {

  int prevmode;

  prevmode = _setmode(_fileno(stdout), _O_U16TEXT);
  fwprintf(stdout,L"one\n");
  fwprintf(stdout,L"two\n");
  fwprintf(stdout,L"three\n");
  _setmode(_fileno(stdout), prevmode);


  return 0;
}

Redirecting the output in Command Prompt. See the 0d0a which should be 0d00 0a00:

c:\test>.\testu16.exe > o.txt

c:\test>xxd o.txt
0000000: 6f00 6e00 6500 0d0a 0074 0077 006f 000d  o.n.e....t.w.o..
0000010: 0a00 7400 6800 7200 6500 6500 0d0a 00    ..t.h.r.e.e....

Redirecting the output in PowerShell. See all the 0000 inserted.

PS C:\test> .\testu16.exe > p.txt
PS C:\test> xxd p.txt
0000000: fffe 6f00 0000 6e00 0000 6500 0000 0d00  ..o...n...e.....
0000010: 0a00 0000 7400 0000 7700 0000 6f00 0000  ....t...w...o...
0000020: 0d00 0a00 0000 7400 0000 6800 0000 7200  ......t...h...r.
0000030: 0000 6500 0000 6500 0000 0d00 0a00 0000  ..e...e.........
0000040: 0d00 0a00                                ....
  • Apparently the error in the first example is that `L"\n"` gets output as `0D 0A 00` rather than `0D 00 0A 00`. No idea what the problem in the second example is; it looks a bit like UTF-32, except for the BOM and the newlines. Hm. – Mr Lister Aug 12 '15 at 17:56
  • 1
    @MrLister if that's the case, then it would be a compiler bug. You have to check the binary to be sure. What compiler are you using, OP? – coladict Jan 20 '16 at 14:45
  • [Hans Passant](http://stackoverflow.com/users/17034/hans-passant) gave the correct answer yesterday, but his answer was removed. [Hans Passant](http://stackoverflow.com/users/17034/hans-passant), if you repost your answer, I will give you the bounty. If not I will post your answer. – Erwin Waterlander Jan 20 '16 at 20:54
  • @coladict, I'm using mingw-w64 and Visual Studio 2013. Both give the same result. – Erwin Waterlander Jan 22 '16 at 07:05
  • @MrLister, the second example is double encoded UTF-16. PowerShell modifies the stream. See my answer below. – Erwin Waterlander Jan 22 '16 at 07:06

2 Answers2

3

I got this answer from Hans Passant. Thanks Hans.

The wrong line breaks are an effect of the buffering of stdout. We need to flush the stream before we set the mode back to the original mode.

prevmode = _setmode(_fileno(stdout), _O_U16TEXT);
fwprintf(stdout,L"one\n");
fwprintf(stdout,L"two\n");
fwprintf(stdout,L"three\n");
fflush(stdout);               /* flush stream */
_setmode(_fileno(stdout), prevmode);

Redirecting the output in Command Prompt (cmd.exe) creates a correct UTF-16 file, without BOM.

c:\test>.\testu16 > o.txt

c:\test>xxd o.txt
0000000: 6f00 6e00 6500 0d00 0a00 7400 7700 6f00  o.n.e.....t.w.o.
0000010: 0d00 0a00 7400 6800 7200 6500 6500 0d00  ....t.h.r.e.e...
0000020: 0a00                                     ..

In powershell the output is still wrong.

PS C:\test> .\testu16 > p.txt
PS C:\test> xxd p.txt
0000000: fffe 6f00 0000 6e00 0000 6500 0000 0d00  ..o...n...e.....
0000010: 0a00 0000 0d00 0a00 0000 7400 0000 7700  ..........t...w.
0000020: 0000 6f00 0000 0d00 0a00 0000 0d00 0a00  ..o.............
0000030: 0000 7400 0000 6800 0000 7200 0000 6500  ..t...h...r...e.
0000040: 0000 6500 0000 0d00 0a00 0000 0d00 0a00  ..e.............
0000050: 0000 0d00 0a00                           ......

This is because PowerShell doesn't keep the stream untouched. It tries to interpret it and convert it to UTF-16. It guessed that the input stream encoding was ANSI. PowerShell added an UTF-16 BOM and the rest is double encoded UTF-16. This explains the extra zeros.

Even using out-file and specifying the encoding doesn't help.

PS C:\test> .\testu16.exe | out-file p.txt -encoding unicode
PS C:\test> xxd p.txt
0000000: fffe 6f00 0000 6e00 0000 6500 0000 0d00  ..o...n...e.....
0000010: 0a00 0000 0d00 0a00 0000 7400 0000 7700  ..........t...w.
0000020: 0000 6f00 0000 0d00 0a00 0000 0d00 0a00  ..o.............
0000030: 0000 7400 0000 6800 0000 7200 0000 6500  ..t...h...r...e.
0000040: 0000 6500 0000 0d00 0a00 0000 0d00 0a00  ..e.............
0000050: 0000 0d00 0a00                           ......

PowerShell needs to be informed about the encoding, which is done by first printing an UTF-16 BOM:

prevmode = _setmode(_fileno(stdout), _O_U16TEXT);
fwprintf(stdout, L"\xfeff");  /* UTF-16LE BOM */
fwprintf(stdout,L"one\n");
fwprintf(stdout,L"two\n");
fwprintf(stdout,L"three\n");
fflush(stdout);               /* flush stream */
_setmode(_fileno(stdout), prevmode);

Now we get a correct UTF-16 file.

PS C:\test> .\testu16 > p.txt
PS C:\test> xxd p.txt
0000000: fffe 6f00 6e00 6500 0d00 0a00 7400 7700  ..o.n.e.....t.w.
0000010: 6f00 0d00 0a00 7400 6800 7200 6500 6500  o.....t.h.r.e.e.
0000020: 0d00 0a00
Community
  • 1
  • 1
1

">" will always redirect your console UTF16 as printable "ASCII", even if you put a BOM on your output or use prevmode = _setmode(_fileno(stdout), _O_BINARY);. I have the same problem with windows7 there is no way to do this with fwprintf.

gj13
  • 1,314
  • 9
  • 23