7

I have a program that prints UTF-8 string to the console:

#include <stdio.h>

int main()
{
    printf("Мир Peace Ειρήνη\n");
    return 0;   
}

I configure the console to use True Type fonts (Lucida Console), define UTF-8 code-page (chcp 65001) compile this program with both MinGW GCC and Visual Studio 2010 it works perfectly, I see: the output:

Мир Peace Ειρήνη

I do the same using std::cout

#include <iostream>

int main()
{
    std::cout << "Мир Peace Ειρήνη\n" ;
    return 0;   
}

This works perfectly fine as above using MinGW GCC but with Visual Studio 2010 I get squares, more than that the squares (two per each non-ASCII letter).

If I run the program with redirection test >test.txt I get perfect UTF-8 output in the file.

Both tests done on Windows 7.

Questions:

  1. What is the difference between printf and std::cout in the Visual Studio standard library in handling of the output stream - clearly one of them works and other does not?
  2. How can this be fixed?

Real Answer:

In short: you are screwed - std::cout does not really work with MSVC + UTF-8 - or at least requires enormous effort to make it behave reasonably.

In long: read two articles referenced in the answer.

Artyom
  • 31,019
  • 21
  • 127
  • 215
  • It's not safe AFAIK to embed the unicode directly in your source code. I believe the safest way is to use some sort of resource or to input unicode code points with \u and the u8 literal (c++11) – Robert Mason Apr 29 '12 at 15:06
  • printf() that outputs unicode and std::cout �� are also matter of [Unicode problems in C++ but not C](http://stackoverflow.com/questions/21370710) – Salvador Apr 21 '14 at 21:01

1 Answers1

1

You have a number of flawed assumptions, lemme correct those first:

  • That things appear to work with g++ does not mean that g++ works correctly.

  • Visual Studio is not a compiler, it's an IDE that supports many languages and compilers.

  • The conclusion that the Visual C++'s standard library needs to be fixed is correct, but the reasoning leading to that conclusion is wrong. Also g++ standard library needs to be fixed. Not to mention the g++ compiler itself.

Now, Visual C++ has Windows ANSI, the encoding specified by the GetACP API function, as its undocumented C++ execution character set. Even if your source code is UTF-8 with BOM, narrow strings end up translated to Windows ANSI. If that, on your computer at the time of compilation, is a code page that includes all the non-ASCII characters, then OK, but otherwise the narrow strings will get garbled. The description of your test results is therefore seriously incomplete without mentioning the source code encoding and what your Windows ANSI codepage is.

But anyway, "If I run the program with redirection test >test.txt I get perfect UTF-8 output in the file" indicates that what you're up against is a bit of C++ level help from the Visual C++ runtime, where it bypasses the stream output and uses direct console output in order to get correct characters displayed in the console window.

This help results in garbage when its assumptions, such as Windows ANSI encoded narrow string literals, don't hold.

It also means that the effect mysteriously disappears when you redirect the stream. The runtime library then detects that the stream goes to a file, and turns off the direct console output feature. You're not guaranteed to then get the raw original byte values, but evidently you did, which was bad luck because it masked the problem.

By the way, codepage 65001 in the console in Windows is not usable in practice. Many programs just crash. Including e.g. more.


One way to get correct output is to use the Windows API level directly, with direct console output.

Getting correct output with the C++ streams is much more complicated.

It's so complicated that there's no room to describe it (correctly!) here, so I have to instead refer you to my 2-part blog article series about it: part 1 and part 2.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Makes sense. But how does this explain the OP's problem that the program outputs _squares_? I would expect the console representation of the UTF-8 bytes: for the Russian М (U+0419) that would be \xD0 \x99, or `´╗` on my machine. – Mr Lister Apr 29 '12 at 17:10
  • The string is UTF-8 (checked, really) I know the entire MSVC/UTF-8 issue (crap). I know to handle it correctly (of original source UTF-8 without BOM then char * gets correct utf-8 of course L"שלום" is messed up, but this is different story, I can do the same with "\xXY" literals, the result is the same; about assumptions the basic assumptions is that `std::cout << str;` should behave same as `puts(str)` This is the assumption and gcc did this right - at least predictable. Now I clearly understand that `std::cout` use some console API which makes the problem even more severe (TBC...) – Artyom Apr 29 '12 at 19:04
  • 2
    because it is not something really expected. Finally I found your (part 2) article, this http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx Kaplan's article and this bug report http://connect.microsoft.com/VisualStudio/feedback/details/431244/std-ostream-fails-to-write-utf-8-encoded-string-to-console . Finally the only reasonable "solution" is to create my own stream buffer. This is yet another example about totally CRAPPY windows Unicode model when 1/2 of the applications does not handle Unicode well. – Artyom Apr 29 '12 at 19:08
  • @Artyom The issue isn't so much that cout is using a special console API. puts() works and cout doesn't work because puts passes the whole sequence of bytes to the console at once. The console takes that data, consults the ConsoleOutputCP to find out what encoding to use to convert it to UTF-16, and continues on its merry way. On the other hand cout has to pass the bytes one at a time. The console takes each byte and tries to convert it from UTF-8 to UTF-16. This fails and results in U+FFFD being displayed on the console. In short the issue is simply that the console model on Windows is dumb. – bames53 May 02 '12 at 00:06