Unicode problems in C++ but not C

Question

I'm trying to write unicode strings to the screen in C++ on Windows. I changed my console font to Lucida Console and I set the output to CP_UTF8 aka 65001.

I run the following code:

#include <stdio.h>  //notice this header file..
#include <windows.h>
#include <iostream>

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    const char text[] = "Россия";
    printf("%s\n", text);
}

It prints out just fine!

However, if I do:

#include <cstdio>  //the C++ version of the header..
#include <windows.h>
#include <iostream>

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    const char text[] = "Россия";
    printf("%s\n", text);
}

it prints: ��

I have NO clue why..

Another thing is when I do:

#include <windows.h>
#include <iostream>

int main()
{
    std::uint32_t oldcodepage = GetConsoleOutputCP();
    SetConsoleOutputCP(CP_UTF8);

    std::string text = u8"Россия";
    std::cout<<text<<"\n";

    SetConsoleOutputCP(oldcodepage);
}

I get the same output as above (non-working output).

Using printf on the std::string, it works fine though:

#include <stdio.h>
#include <windows.h>
#include <iostream>

int main()
{
    std::uint32_t oldcodepage = GetConsoleOutputCP();
    SetConsoleOutputCP(CP_UTF8);

    std::string text = u8"Россия";
    printf("%s\n", text.c_str());

    SetConsoleOutputCP(oldcodepage);
}

but only if I use stdio.h and NOT cstdio.

Any ideas how I can use std::cout? How can I use cstdio as well? Why does this happen? Isn't cstdio just a c++ version of stdio.h?

EDIT: I've just tried:

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main()
{
    _setmode(_fileno(stdout), _O_U8TEXT);
    std::wcout << L"Россия" << std::endl;
}

and yes it works but only if I use std::wcout and wide strings. I would really like to avoid wide-strings and the only solution I see so far is the C-printf :l

So the question still stands..

What if you do `std::printf` when you're including `cstdio`? — Joseph Mansfield, Jan 26 '14 at 23:35
It prints the same bad characters. No difference with or without the `std::` I'm using Mingw 4.8.1. The very latest build. — Brandon, Jan 26 '14 at 23:36
I tried this experiment once in VS2010. The result: Don't use UTF8. IIRC the main problem was the buffer of the streams, i.e. `cout` would pass one char at a time to the console, which then can't render multi-unit code points correctly. — dyp, Jan 26 '14 at 23:49
The u8 in your "string" examples does *not* mean "string." In fact, it means "UTF8-encoded string literal." You should be able to use u8 in the C++ printed example to get the correct output. — Max Lybbert, Jan 27 '14 at 00:19
None of it works except using C's `printf` and the `_setmode` — Brandon, Jan 27 '14 at 00:35
@MarkRansom Yes, you can replace the buffer of cout with your own. I tried that, too, and it "works", but the CRT itself doesn't support completely (\*) and the console doesn't either IIRC. (\*) MSDN says [`setlocale`](http://msdn.microsoft.com/en-us/library/x99tb11d.aspx) doesn't support locales with more than two bytes per character. — dyp, Jan 27 '14 at 10:16
What dyp said - code page 65001 is broken to the point where it is typically unusable. Multibyte encodings are only correctly supported in the MS CRT for the ANSI code pages like 932 and 936 that Windows defaults to in certain East Asian locales. Dealing with strings in UTF-8 format may be a sensible thing to do internally, but on Windows it is still a second-class citizen which doesn't work right with any of the standard byte-oriented C stdlib interfaces. You are usually better off with a layer to convert Win32 wide APIs to UTF-8, sadly. — bobince, Jan 27 '14 at 14:19

Max Lybbert · Accepted Answer · 2014-01-27T19:09:31.483

Although you've set your console to expect UTF-8 output, I suspect that your compiler is treating string literals as being in some other character set. I don't know why the C compiler acts differently.

The good news is that C++11 includes some support for UTF-8, and that Microsoft has implemented the relevant portions of the Standard. The code is a little hairy, but you'll want to look into std::wstring_convert (converts to and from UTF-8) and the <cuchar> header.

You can use those functions to convert to UTF-8, and assuming your console is expecting UTF-8, things should work correctly.

Personally, when I need to debug something like this, I often direct the output to a text file. Text editors seem to handle Unicode better than the Windows console. In my case, I often output the code points correctly, but have the console set up incorrectly so that I still end up printing garbage.

I can tell you that this worked for me in both Linux (using Clang) and Windows (using GCC 4.7.3 and Clang 3.5; you need to add "std=c++11" to the command line to compile with GCC or Clang):

#include <cstdio>

int main()
{
    const char text[] = u8"Россия";
    std::printf("%s\n", text);
}

Using Visual C++ (2012, but I believe it would also work with 2010), I had to use:

#include <codecvt>
#include <cstdio>
#include <locale>
#include <string>

int main()
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
    auto text = converter.to_bytes(L"Россия");
    std::printf("%s\n", text.c_str());
}

score 1 · Answer 2 · answered Jan 27 '14 at 02:59

If your file is encoded as UTF-8, you'll find the string length is 12. Run strlen from <string.h> (<cstring>) on it to see what I mean. Setting the output code page will print the bytes exactly as you see them.

What the compiler sees is equivalent to the following:

const char text[] = "\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd1\x8f";

Wrap it in a wide string (wchar_t in particular), and things aren't so nice.

Why does C++ handle it differently? I haven't the slightest clue, except perhaps the mechanism used by the code underlying the C++ version is somewhat ignorant (e.g. std::cout happily outputs whatever you want blindly). Whatever the cause, apparently sticking to C is safest...which is actually unexpected to me considering the fact that Microsoft's own C compiler can't even compile C99 code.

In any case, I'd advise against outputting to the Windows console if possible, Unicode or not. Files are so much more reliable, not to mention less of a hassle.

score -2 · Answer 3 · answered Jan 27 '14 at 00:35

-2

It's more surprising that C implementation does work here than that C++ doesn't. char can contain only one byte (numerical values 0-255) and thus console should show only ASCII characters.

C must be doing some magic for you here - in fact it guesses that these bytes from outside the ASCII range (which is 0-127) you're providing form an Unicode (probably UTF-8) multi-byte character. C++ just displays each byte of your const char[] array, and since UTF bytes treated separately don't have distinct glyphs in your font, it puts these �. Note that you assign 6 letters and get 12 question marks.

You can read about UTF-8 and ASCII encoding if you want, but the point is that std::wstring and std::wcout is really the best solution designed to handle larger-than-byte characters.

(If you're not using Latin characters at all, you don't even save memory when you use char-based solutions such as const char[] and std::string instead of std::wstring. All these Cyrillic codes have to take some space anyway).

answered Jan 27 '14 at 00:35

szym_rutkowski

24
3

If I use `std::wstring` and `std::wcout`, it prints nothing.. nothing at all. In fact, that was the first thing I tried. I was also surprised that the C-code worked but not the C++ code. I tried everything for gcc/g++ including `setlocale(LC_ALL, "Russian")` and `system("chcp 65001 > 0");`. Everything. The only solutions that worked was the C one and the `_setmode` one and those are on the OP. Nothing else works/worked. Not even C++'s `printf`. – Brandon Jan 27 '14 at 00:37
That is only if I use `_setmode`. Using `std::wcout` with `SetConsoleOutputCP` does not work. – Brandon Jan 27 '14 at 00:47
Seems correct to me. This way you declare that you'll be using UTF8. – szym_rutkowski Jan 27 '14 at 00:52
2

Are you sure? I was positive that UTF8 did not require `std::wstring` or `wide-chars` or `L` prefixes. After all, the `printf` did not require that. – Brandon Jan 27 '14 at 00:54
I'm not 100% sure. `printf` works with `const char*`s (and nothing more complex) anyway. C++ might be forcing on you good conventions (indeed unusual for that language) such as not using one-byte types to store multi-byte characters :) – szym_rutkowski Jan 27 '14 at 01:10
2

Uhm what. UTF-8 is using bytes; for storing UTF-8, you DO NOT WANT, I repeat: **_YOU DO NOT WANT_** `wchar`s. `wchar`s are for UTF-16 on Windows and UTF-32 on *nixes; they are NOT, I repeat: they are NOT for UTF-8. Also: as long as you are not interacting with WinAPI directly, don't you dare using wide characters and wide strings. Never ever. – Griwes Jan 27 '14 at 07:48
@Griwes Can you explain why, technically, `wchar`s are not for UTF8? UTF8 can take only one byte for ASCII chars, but if scale isn't that big why should he bother. (OK, I'm not sure how `wchar` behaves when writing to file). You provide bolded imperatives without explanation. – szym_rutkowski Jan 27 '14 at 10:38
@szym_rutkowski, because code units in UTF-8 take exactly 8 bits. 8 bits is a byte. If you use wide chars (16 or 32 bits!) for UTF-8 string, each of those 8 significant bits will occupy more space (16 or 32 bits) than it is actually needed. This is the exact reason why UTF-8 is superior in most aspects - you can easily keep it in poor old `char`s (minus the fact that you cannot perform string algorithms that are unaware of UTF-8 on it; but not like it isn't an universal problem). – Griwes Jan 27 '14 at 11:11
1

@szym_rutkowski, another reason: you should (mostly) trust people on the committee, because they (usually) know what they are doing. Modulo the fact that `u8` should use some new type, not the old ones - `u8` gives you a string of `char`s, NOT wide chars. Let's reiterate: never use wide chars for UTF-8. Never ever. If you need further explanation, consider reading about UTF-8 and about using it. – Griwes Jan 27 '14 at 11:13
Also: "C++ just displays each byte of your const char[] array, and since UTF bytes treated separately don't have distinct glyphs in your font" ugh no. C++ doesn't display anything. `std::cout` does not display anything. It just sends raw bytes to the system, and the system displays it. C++ doesn't care what is in those bytes. It's the system that takes the raw string and attempts to interpret it in sane way. And that sane way should involve decoding UTF-8 sequences into Unicode code points. – Griwes Jan 27 '14 at 11:21
1

My statement about "C++ displaying something" was just to make the thing simpler, I'm aware of what you said. Also: I thought about using `wchar`s in a way that each `wchar` stores one UTF8 character, regardless of its actual byte-length. Using `wchar` in a way you're describing would be obviously foolish. Maybe I'll check with gdb how it really behaves. – szym_rutkowski Jan 27 '14 at 16:12
Each `wchar_t` unit storing a single character doesn't necessarily work on Windows, UTF-8 especially (`sizeof(wchar_t)` is 2 on Windows while `strlen("\xe7\x89\x8b")` is 3, and that code point U+724B doesn't even need surrogate pairs in UTF-16, which require two `wchar_t` units to represent one code point). – Jan 27 '14 at 19:33

Unicode problems in C++ but not C

3 Answers3

Linked