0

I want to read this cyrillic text from a .txt file: аааааааааааа

std::wstring str;
std::wifstream in(path);
std::getline(in, str);
in.close();

But the content of str is: аааааааааааа (file encoding - UTF-8) (Watched string content in debug, not in console)

I tried to change file encoding to UTF-16 (LE and BE), have: ÿþ000000000000 and þÿ000000000000

Also, I found this solution, but as you can see, it didn't help.

Boloto
  • 45
  • 1
  • 8
  • `wstring` and `wifstream` are for `wchar_t`, which does not entail an encoding, so it is not UTF-16 or UTF-32. (It can be one of those encodings, but that's not mandated by the data type.) Since your file is UTF-8, you probably should be using `string` and `ifstream`, which are `char` based. (Like `wchar_t`, the `char` type also does not entail an encoding.) (*I did not downvote.*) – Eljay Mar 12 '21 at 14:08
  • @Eljay `string` and `ifstream` have the same problem. The only difference between `string` and `wstring` is that if I directly assign the cyrillic text to a variable in the program, `string` have same bullshit, when `wstring` store original text correctly – Boloto Mar 13 '21 at 09:09

1 Answers1

1

In Windows you have to open the file in binary, then apply the UTF16 facet, otherwise system will assume default code page. See example below.

Note that it is common to use UTF8 for storing data, even in Windows applications. Your Windows program expects UTF16 for the APIs, so you can read/write the file in UTF8, then convert back and forth to UTF16

#define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING
//silence codecvt warnings

std::wstring str;
std::wifstream in(path, std::ios::binary);
in.imbue(std::locale(in.getloc(), 
    new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
std::getline(in, str);
in.close();

You can also use pubsetbuf to avoid codecvt warnings:

std::wifstream in(path, std::ios::binary);
wchar_t wbuf[128] = { 0 };
in.rdbuf()->pubsetbuf(wbuf, 128);

//BOM check
wchar_t bom{};
in.read(&bom, 1);
if(bom == 0xfeff)
    std::cout << "UTF16-LE\n";

//read file
std::wstring str;
std::getline(in, str);
in.close();
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • Yea, it's work,it also needed to save the file as UTF-16LE. But I have one more question. When I tried to find a solution, I already saw the `codevect`, a little differently, but this is not about that now. I read that codevct is deprecated and not entirely secure. Could this be problematic, and what does it mean that it is deprecated in the context of C ++? Thanks for your help and answers! – Boloto Mar 13 '21 at 09:34
  • 1
    This code is handled differently in Linux and Windows. `wchar_t` is also different in Linux and Windows. That results in different codes. "deprecated" means "this code may not be supported in future, use the newer code..." But in this case there is no replacement for codecvt yet. The best option is to use UTF8 for storing data. Your data will be compatible with Linux and web-based data... Just use `MultiByteToWideChar(CP_UTF8, ...)` to convert UTF8 to UTF16. Visual Studio supports wide char filenames for `std::fstream` so you can write `std::fstream(L"filename.txt")` – Barmak Shemirani Mar 13 '21 at 13:08