2

I have a simple piece of code that opens a file stream & prints out stuff. As soon as it hits a unicode character, it stops reading.

My system is set to Japanese locale & Visual Studio is set to compile as unicode. Not sure whats going on.

File:

<abc \ 单位孤>hajslklfasjflkesjfleajflj

File Hex Dump:

EF BB BF 3C 61 62 63 20 5C 20 E5 8D 95 E4 BD 8D
E5 AD A4 3E 68 61 6A 73 6C 6B 6C 66 61 73 6A 66
6C 6B 65 73 6A 66 6C 65 61 6A 66 6C 6A 0D 0A

Code Part:

std::wifstream fin(path, std::ios::binary);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10ffff, std::consume_header>));
if (!fin.good()) return;

while (fin.good()) {
    std::wcout << (wchar_t)fin.get() << "\n";
}

fin.close();

Output:

Output

Nyaarium
  • 1,540
  • 5
  • 18
  • 34

2 Answers2

2

It's reading fine, it's just not writing.

std::wcout << (wchar_t)fin.get() << "\n";

Unfortunately std::wcout doesn't actually reliably get Unicode to a terminal.

Although the Windows terminal works natively in UTF-16 code units, std::wcout is still defined in purely byte-based terms. It converts its wide input down to bytes using the locale-specific default encoding before writing to the good old Unicode-ignorant byte stdout stream (which could be a natively-bytes file redirection as well as a natively-Unicode terminal output, after all).

So std::wcout ends up being just as limited under Windows as all the other byte IO interfaces, restricted to characters in the current code page. Your code page is probably 932, where character U+5355 doesn't exist, so trying to write it breaks the stream.

Setting the current code page to 65001 in an attempt to get the same UTF-8 output that all other modern platforms prefer doesn't quite work due to assorted multibyte char-counting bugs in the basic C runtime. MS have left this broken for many multiple versions so expect UTF-8 to remain a second-class citizen under Windows.

Some alternatives:

  1. Use the Win32 WriteConsoleW API instead of stdlib interfaces. (Requires care to handle possible output redirection, and if you need your project to be cross-platform compatible.)

  2. Use _setmode with _O_U16TEXT to change the output stream to UTF-16-encoded bytes. See example in this question. It seems not all interfaces necessarily work in this mode; you're probably in for trouble if you try to use the byte interfaces at the same time.

  3. Output explicitly UTF-8-encoded bytes and require Windows console users to just put up with the mojibake and missing glyphs that result.

It is a shame this story is still so miserable.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834
0

std::wcout may have something to do with it.

Try this page: https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/

//std::locale loc2 = std::locale("zh-CN");
//SetConsoleOutputCP(CP_UTF8);
//SetConsoleCP(65001);
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << "text:" << L"<abc单位孤>hajslklfasjflkesjfleajflj" << "\n";
_setmode(_fileno(stdout), _O_WTEXT);
std::wcout << "text:" << L"<abc单位孤>hajslklfasjflkesjfleajflj" << "\n";
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << "text:" << L"<abc单位孤>hajslklfasjflkesjfleajflj" << "\n";
//setlocale(LC_ALL, "C");
//fputs("hello 2: ΓΔΕΘΛΞΠΣΦΨЪЩШЫЮЯ\n", stdout);
std::wcout << "text:" << L"hello 2: ΓΔΕΘΛΞΠΣΦΨЪЩШЫЮЯ" << "\n";
wprintf(L">>> hello 2: ΓΔΕΘΛΞΠΣΦΨЪЩШЫЮЯ \n");
std::locale loc3 = std::locale("en-US");
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << "text:" << L"<abc单位孤>hajslklfasjflkesjfleajflj" << "\n";
_setmode(_fileno(stdout), _O_WTEXT);
std::wcout << "text:" << L"<abc单位孤>hajslklfasjflkesjfleajflj" << "\n";
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << "text:" << L"<abc单位孤>hajslklfasjflkesjfleajflj" << "\n";
//setlocale(LC_ALL, "C");
//fputs("hello 2: ΓΔΕΘΛΞΠΣΦΨЪЩШЫЮЯ\n", stdout);
std::wcout << "text:" << L"hello 2: ΓΔΕΘΛΞΠΣΦΨЪЩШЫЮЯ" << "\n";
wprintf(L">>> hello 2: ΓΔΕΘΛΞΠΣΦΨЪЩШЫЮЯ \n");

depending on how you enter the chcp intvalue command, you will get an output directly related to codepage 1252 and 65001

I did write a test for unicode a week or two back then. It might help you, pls see https://github.com/MagnusTiberius/wcutil/blob/master/widechartest.cpp for details.

You may also want to check this out on how to set code page to render double/multi-byte.

http://www.curlybrace.com/words/2014/10/03/windows-console-and-doublemulti-byte-character-set/

benG
  • 82
  • 9