4

My code for writing text works for ANSI characters, but when I try to write Japanese characters they do not appear. Do I need to use UTF-16 encoding? If so, how would I do it on code?

std::wstring filename;
std::wstring text;
filename = "path";
wofstream myfile;
myfile.open(filename, ios::app);
getline(wcin, text);
myfile << text << endl;
wcin.get();
myfile.close();
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 2
    So, you're running this in a terminal? What kind? And what happens if you pipe files with different charset encodings into your program? – einpoklum Sep 21 '20 at 19:58
  • It's a console application in Visual C++ 19 and, I am sorry, I don't seem to understand your others questions. – Guilherme Galdino Sep 21 '20 at 20:00
  • @GuilhermeGaldino You may want to set a BOM? – πάντα ῥεῖ Sep 21 '20 at 20:08
  • Most Japanese characters are not representable as a single `char`. So they will be encoded as multiple bytes - that may be utf-16, utf-32, utf-8 or some other encoding - you need to know how the thing you are reading is encoded in order to be able to interpret it correctly. Internationalisation and dealing with multiple char sets and encodings is *hard*. – Jesper Juhl Sep 21 '20 at 20:10
  • 1
    @JesperJuhl: Maybe he's just running it in a `cmd` session, and only knows that he's typing in Japanese. He's expecting `wcin` to "just work" (and that's not an unreasonable expectation...) – einpoklum Sep 21 '20 at 20:12
  • 1
    Did you check in a debugger if after `getline(wcin, text)`, `text` contains correct characters? – rustyx Sep 21 '20 at 20:13
  • @πάνταῥεῖ "You may want to set a BOM" - those are mostly ignored in my experience. Besides, that only deals with the potential endianness problem, not the encoding problem. – Jesper Juhl Sep 21 '20 at 20:14
  • @einpoklum Given the current state of the C++ standards library's handling of non-ascii char sets, locales and encodings, I actually *would say* that it's an unreasonable expectation. – Jesper Juhl Sep 21 '20 at 20:17
  • 1
    @rustyx Yes, `text` contains the characters I typed, worked in English, Japanese and Russian, only the file itself doesn't contain them – Guilherme Galdino Sep 21 '20 at 20:18
  • @GuilhermeGaldino It probably does contain them. How are you viewing the file? – Mooing Duck Sep 21 '20 at 23:02

3 Answers3

3

From the comments it seems your console correctly understands Unicode, and the issue is only with file output.

Here's how to write a text file in UTF-16LE. Just tested in MSVC 2019 and it works.

#include <string>
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>

int main() {
    std::wstring text = L"test тест 試験.";
    std::wofstream myfile("test.txt", std::ios::binary);
    std::locale loc(std::locale::classic(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>);
    myfile.imbue(loc);
    myfile << wchar_t(0xFEFF) /* UCS2-LE BOM */;
    myfile << text << "\n";
    myfile.close();
}

You must use std::ios::binary mode for output under Windows, otherwise \n will break it by expanding to \r\n, ending up emitting 3 bytes instead of 2.

You don't have to write the BOM at the beginning, but having one greatly simplifies opening the file using the correct encoding in text editors.

Unfortunately, std::codecvt_utf16 is deprecated since C++17 with no replacement (yes, Unicode support in C++ is that bad).

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • The question does ask "Do I need to use UTF-16 encoding?", which to me doesn't seem like a hard commitment to UTF-16. I think the goal is just to save Unicode to a file. Generally UTF-8 without a BOM would be better. – Eryk Sun Sep 22 '20 at 06:03
  • Worked right away! I still need to study why it worked but thanks anyway! – Guilherme Galdino Sep 24 '20 at 19:04
  • A wide-character stream like `wcout` or `wofstream` takes wide characters (`wchar_t`) as input and outputs them as a sequence of bytes. Exactly how `wchar_t` are *encoded* into bytes is determined by the `locale`. By default some single-byte encoding like cp1252 is used, which can represent at most 256 code points, the rest are simply dropped during encoding. The `codecvt_utf16` facet tells the locale to encode `wchar_t` as UTF-16. Hope this clarifies a little... – rustyx Sep 24 '20 at 21:05
1

Expanding my answer to your last question, here's a C library solution for writing the file. I saved the source as UTF-8 and compiled with Microsoft "cl /EHsc /W4 /utf-8 test.cpp".

#include <fcntl.h>
#include <io.h>
#include <string>
#include <iostream>

// From fctrl.h:
//  #define _O_U16TEXT     0x20000 // file mode is UTF16 no BOM (translated)
//  #define _O_WTEXT       0x10000 // file mode is UTF16 (translated)

using namespace std;

int main()
{
    // Declare console I/O that works with Unicode.
    _setmode(_fileno(stdout),  _O_WTEXT);  // or _O_U16TEXT, either work
    _setmode(_fileno(stdin), _O_WTEXT);

    // Send a string to the console to verify stdout works with wide strings.
    wstring s = L"こんにちは, 世界!\nHello, World!";
    wcout << s << endl;

    // Read an input string.  I used an IME to enter Chinese.
    // Verify the stdin works...
    wstring test;
    getline(wcin, test);

    // Write it back out to stdout...
    wcout << test << endl;

    // Write it to a file as UTF-16.
    FILE *dest = fopen("out.txt", "w, ccs=UTF-16LE");
    fwprintf(dest, L"%s", test.c_str());
    return 0;
}

Output (console):

C:\>test
こんにちは, 世界!
Hello, World!
你好,马克!
你好,马克!

C:\>type out.txt
你好,马克!

Hex dump of the file content showing UTF-16LE w/ BOM encoding:

ff fe 60 4f 7d 59 0c ff 6c 9a 4b 51 01 ff
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Sadly handling encoding in standard C++ is not very handy. On Posix systems single byte streams are working better, on Windows wchar_t streams are more handy.

To handle file encoding you need to set std::locale on stream using imbue.

Note that boost extends locale functionality, so it may turn out that you will have to use it to make it work.

Usually it is recommended to use system locale:

// set global locale so wcin is able to read data properly
std::locale::global(std::locale{""});

std::wstring filename;
std::wstring text;
filename = "path";
wofstream myfile;

// if you need some specific text encoding check which one your system supports
// it may be something like "C.UTF-16"
myfile.imbue(std::locale{""});
myfile.open(filename, ios::app);
getline(wcin, text);
myfile << text << endl;
wcin.get();
Marek R
  • 32,568
  • 6
  • 55
  • 140