8

Hopefully a simple question: cout seems to die when handling strings that end with a multibyte UTF-8 char, am I doing something wrong? This is with GCC (Mingw) on Win7 x64.

**Edit Sorry if I wasn't clear enough, I'm not concerned about the missing glyphs or how the bytes are interpreted, merely that they are not showing at all right after the call to cout << s4 (missing BAR). Any further couts after the first display no text whatsoever!

#include <cstdio>
#include <iostream>
#include <string>

int main() {
    std::string s1("abc");
    std::string s2("…");  // … = 0xE2 80 A6
    std::string s3("…abc");
    std::string s4("abc…");

    //In C
    fwrite(s1.c_str(), s1.size(), 1, stdout);
    printf(" FOO ");
    fwrite(s2.c_str(), s2.size(), 1, stdout);
    printf(" BAR ");
    fwrite(s3.c_str(), s3.size(), 1, stdout);
    printf(" FOO ");
    fwrite(s4.c_str(), s4.size(), 1, stdout);
    printf(" BAR\n\n"); 

    //C++
    std::cout << s1 << " FOO " << s2 << " BAR " << s3 << " FOO " << s4 << " BAR ";
}

// results:

// abc FOO ��� BAR ���abc FOO abc… BAR

// abc FOO ��� BAR ���abc FOO abc…
user657267
  • 20,568
  • 5
  • 58
  • 77
  • Where are you running your program? The Windows command prompt really doesn't like Unicode much, so while your program might output text just fine, the console doesn't know what to do with it. – jalf Aug 05 '11 at 09:08
  • 5
    @jalf: The Windows console subsystem doesn't have real issues. `WriteConsoleW` works reasonably well given correct fonts. Windows doesn't like UTF-8, though, which means that `WriteConsoleA` is going to choke here. – MSalters Aug 05 '11 at 09:10
  • Works for me on Ubuntu/gnome-terminal/GCC. I suspect getting this right requires both C++ correctness *and* taking platform specifics into account. – John Bartholomew Aug 05 '11 at 09:11
  • @MSalters: Oh true, I should've been more specific. – jalf Aug 05 '11 at 09:17
  • Pipe the output into a file and open that file in notepad. What happens? – Kerrek SB Aug 05 '11 at 09:22
  • Calling SetConsoleCP(65001) is required to switch the console to utf8. Finding a fixed pitch font that is capable of displaying Unicode glyphs is going to be the hard problem. – Hans Passant Aug 05 '11 at 10:43
  • @Hans Passant: Lucinda Console Truetype should do the trick. See http://support.microsoft.com/kb/99795 – MSalters Aug 05 '11 at 11:29
  • @MSalters - it doesn't, it has very few glyphs. Check it out with charmap.exe – Hans Passant Aug 05 '11 at 11:39
  • The next problem you're battling is that the CRT code doesn't handle a Unicode code page properly. Fixed in the next version of VS, fallback to WriteConsole(). If you get the impression you are trying to do something that isn't well supported then you're right. – Hans Passant Aug 05 '11 at 11:43
  • @MSalters: Not being able to handle UTF-8 is not a real issue??? It’s a deathblow. – tchrist Aug 05 '11 at 20:46

4 Answers4

4

If you want your program to use your current locale, call setlocale(LC_ALL, "") as the first thing in your program. Otherwise the program's locale is C and what it will do to non-ASCII characters is not knowable by us mere humans.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • +1 to n.m. On Windows, calling `setlocale(LC_ALL, "")` and doing `chcp 65001` was the trick for Unicode in the console – Matthew Jan 31 '13 at 00:10
2

This is really no surprise. Unless your terminal is set to UTF-8 coding, how does it know that s2 isn't supposed to be "(Latin small letter a with circumflex)(Euro sign)(Pipe)", supposing that your terminal is set to ISO-8859-1 according to http://www.ascii-code.com/

By the way, cout is not "dying" as it clearly continues to produce output after your test string.

koan
  • 3,596
  • 2
  • 25
  • 35
  • 1
    Good point. `std::cout` only echoes a stream of bytes to the outside world. How they are interpreted is between you and the program which will ultimately read those bytes. – Alexandre C. Aug 05 '11 at 10:26
  • @user657267 - Yep, cout outputs nothing, but if you use printf then you get what you expect (unless you haven't used the correct console font, and done `chcp 65001` – Peter Nimmo May 16 '13 at 18:08
0

The Windows console does not handle non-local-codepage characters by default.

You'll need to make sure you have a Unicode-capable font set in the console window, and that the codepage is set to UTF-8 through a call to chcp. This is not a guaranteed success though. Note that `wcout´ changes nothing if the console can't show the fancy characters because its font is botched.

On all modern Linux distros, the console is set to UTF-8 and this should work out of the box.

rubenvb
  • 74,642
  • 33
  • 187
  • 332
0

As others have pointed out, std::cout is agnostic about this, at least in "C" locale (the default). On the other hand, your console window must be set up to display UTF-8: code page 65001. Try invoking chcp 65001 before executing your program. (This has worked for me in the past.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329