3

I saw dozens of questions about this topic but none of them helped me.

Suppose I have a string "հայեր" or "русский" (wchat_t*, wstring, LPTSTR, or something else).

Now I want to create an output file "res.txt" and write the string down. While trying to write it, I ended up with writing nothing, '?' and some arbitrary numbers to the file.

Can someone suggest me a way of how to keep non-english strings and how properly write them into a file?

Thanks.

mbaros
  • 825
  • 8
  • 31
  • 1
    How did you write the file? If you are using a `wchar_t*` or `std::wstring` you should be using a `std::wofstream`. – NathanOliver Jul 18 '16 at 12:15
  • I tried but non of the combinations help. So you suggest to keep it as a wstring and use wofstream? If yes, then it does not write anything on the file – mbaros Jul 18 '16 at 12:18
  • How are your source file encoded ? utf8, Latin-1, ... ? – Jarod42 Jul 18 '16 at 12:39

3 Answers3

2

Before you can even begin the task of creating text files containing non-Latin characters, you have to determine which encoding to be used for your locale.

For example, if your locale uses the UTF-8 encoding, the string "русский" will have to be encoded completely differently than if your locale is KOI8-R.

The string "русский" in UTF-8 is represents by the octets (bytes): d1 80 d1 83 d1 81 d1 81 d0 ba d0 b8 d0 b9. For a KOI8-R locale, the equivalent octets are d2 d5 d3 d3 cb c9 ca.

Internationalization is hard.

In most cases you might be able to use the C++ library's wide character streams with unicode:

#include <iostream>
#include <locale>

int main()
{
    std::locale::global(std::locale(""));
    std::wcout << L"\u0440\u0443\u0441\u0441\u043a\u0438\u0439" << std::endl;
    return 0;
}

Hopefully, the output from this will be "русский", on your platform. Provided this works, this might be the path of least resistance, but you will have to look up the unicode values for each character.

There's also support for UTF-8 in the new C++ standard, but the answer here is for you to spend some time educating yourself on the general concepts of locale, unicode, and internationalization. It will be difficult to do this right without a complete understanding of how all of these things work.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148
  • it does not ouput anything neither on the console nor in te file. – mbaros Jul 18 '16 at 13:39
  • Then your C++ compiler or operating system does not support even the minimum requirements of the current C++ standard; or your current locale does not include the Cyrillic alphabet. Nothing further can be determined without knowing the compiler, and the system environment's locale. – Sam Varshavchik Jul 18 '16 at 21:49
0

This worked for me well.

    #include <fstream>
    #include <locale>
    #include <codecvt>

    const locale utf8_locale = locale(locale(), new codecvt_utf8<wchar_t>());
    wofstream file(url);
    file.imbue(utf8_locale);
    file << L"իմբյու" << endl;
mbaros
  • 825
  • 8
  • 31
-1

You have to use unicode for encoding.

Sia
  • 201
  • 1
  • 9
  • How? Can you bring a full example please? – mbaros Jul 18 '16 at 12:19
  • You need to use escapes in form of "\uXXXX" (Xs stands for decimal numbers) - in this way you can use unicode symbols in your code. Unfortunately, it's quite impractical when having the whole text in these symbols. C++ has a pretty bad use of unicode in general... (so if your whole program is based on unicode, i would perhaps suggest to use some different language) – Sia Jul 18 '16 at 12:27
  • wow... that's quite a strong statement. I use C++ often, and I have zero problems with unicode. But it boils down to how you use it, and what do you need. As I'm living in linux world, most of the strings/texts around are UTF-8 encoded, which works with C++ very well. In Windows they decided to go for UTF-16, which is tiny bit more tricky in C++ sources (but with good IDE and your memory to mark each string literal "wide" it should be fine), but IMO works quite OK, far from "bad use". Also no need to work with \uXXXX explicitly if you plan well, and know what you are doing. – Ped7g Jul 18 '16 at 13:11
  • @Ped7g thanks for the comment. I also think the same way. But the problem is that it is very tricky and complicated in c++ in terms of understanding it. Below I posted a solution to the answer, but I don't understand it's meaning. I believe there are other ways to do that. – mbaros Jul 18 '16 at 17:39
  • @Ped7g one more question. Do you know how to print UTF-16 to console in C++? I mean the functions and syntax. – mbaros Jul 18 '16 at 17:40
  • No, not from head, I'm living in UTF-8 world. But with wofstream it should be straightforward. If it does not work for you, there's maybe something wrong with environment you are running that code in. If you are doing console application in Windows for cmd.exe, it may be your cmd.exe is not in UTF-16, but some local 8bit encoding (like KOI8-R), then wofstream will not work (no details/questions, I left Windows at 2006, I have no idea how current console looks/works there). You have to set correct encoding both in the code itself (to emit bytes), and the receiver (console) to expect such bytes. – Ped7g Jul 18 '16 at 17:53
  • @mbaros But to verify your code works as expected, stop using console, output the strings into the file, then use some Hex-viewer or some text editor (with ability to manually set file encoding) to verify the file bytes follow the selected encoding. If you will be able to correctly output data files, you will know you are sending to console correct data. Then any further problems means that you are either using wrong console API (not supporting your encoding), or you didn't init it correctly, or your console is not set correctly, etc... (I assume you are on MSwin, enjoy the encoding hell :P ) – Ped7g Jul 18 '16 at 17:56