You are using std::string
which, by default, deals with ASCII encoded text, and you are opening your file in "text translation mode". The first thing you need to do is open the file in binary mode so that it doesn't perform translation on individual char
values:
in_file.open(file_name.c_str(), std::ios::binary);
or in C++11
in_file.open(file_name, std::ios::binary);
The next thing is to stop using std::string for storing the text from the file. You will need to us a string type that recognizes the character encoding you are using and use the appropriate character type.
As it turns out, std::string
is actually an alias for std::basic_string<char>
. In C++11 several new unicode character types were introduced, in C++03 there was wchar_t
which supports "wide" characters (more than 8 bits). There is a standard alias for basic_string
s of wchar_t
s: std::wstring
.
Start with the following simple test:
#include <iostream>
#include <fstream>
#include <string>
int main() {
std::string file_name = "D:/TEI/dictionary1.txt";
std::wifstream in_file(file_name, std::ios::binary);
if (!in_file.is_open()) {
// "L" prefix indicates a wide string literal
std::wcerr << L"file open failed\n";
return 1;
}
std::wstring line1;
std::getline(in_file, line1);
std::wcout << L"line1 = " << line1 << L"\n";
}
Note how cout
etc also become prefixed with w
...
The standard ASCII characterset contains 128 characters numbered 0 thru 127. In ASCII \n
and \r
are represented with a 7-bit value of 13 and 10 respectively.
Your text file appears to be UTF-8 encoded. UTF-8 uses an 8-bit unsigned representation that allows characters to use a variable number of bytes: the value 0
requires 1 byte, the value 128
requires 2 bytes, the value 8192 requires 3 bytes, and so on.
A value with the highest-bit (2^7) clear is a single, 7-bit ascii value or the end of a multibyte-sequence. If the highest-bit is set, the lower bits are considered to be a "prefix value". So the byte sequence { (128+2), 0 }
would represent the value (2 << 7) | 0
or (wchar_t)256
. The byte sequence { 130, 13 }
represents (2 << 7) | 13
or wchar_t 269
.
You can read and write utf-8 values through char
streams and storage, but only as opaque byte streams. The moment you start to need to understand the values you generally need to resort to wchar_t
, uint16_t
or uint32_t
etc.
If you are working with Microsoft's toolset (noting the "D:/" path), you may need to look into TCHAR
(https://msdn.microsoft.com/en-us/library/c426s321.aspx)