0

I have a text file which I am adding tags to in order to make it XML readable. In order for our reader to recognize it as valid, each line must at least be wrapped in tags. My issue arises because this is actually a Syriac translation dictionary and so there are many non-standard characters (the actual Syriac words). The most straight-forward way I see to accomplish what I need is to simply prepend and append each line with the needed tags, in place, without necessarily accessing or modifying the rest of the line. Any other options would also be greatly appreciated.

ifstream in_file;
string file_name;

string line;
string line2;
string pre_text;
string post_text;

int num = 1;

pre_text = "<entry n=\"";
post_text = "</entry>";

file_name = "D:/TEI/dictionary1.txt";
in_file.open(file_name.c_str());

if (in_file.is_open()){
    while (getline(in_file, line)){
        line2 = pre_text + to_string(num) + "\">" + line + post_text;
        cout << line2;
        num++;
    }
}

The file in question may be downloaded here.

jww
  • 97,681
  • 90
  • 411
  • 885
Spencer
  • 453
  • 4
  • 21
  • 1
    Consider running the file IO in unicode. Save you a lot of grief. http://stackoverflow.com/questions/5026555/c-how-to-write-read-ofstream-in-unicode-utf8 And particularly watch out for watch out for point 3. – user4581301 Jun 26 '15 at 22:38
  • Rather than appending to a string with the + operator, consider using a [stringstream](http://en.cppreference.com/w/cpp/io/basic_stringstream) . It should be swifter and will make the eventual switch to writing to an output file a cakewalk. – user4581301 Jun 26 '15 at 22:41
  • 2
    Remember to never output over an existing file, always output to a new file and then rename. – o11c Jun 26 '15 at 22:45
  • My problem is that when the compiler reads in the line it jumbles all the Syriac (Aramaic font) text. – Spencer Jun 26 '15 at 22:59
  • Possible duplicate of [How to overwrite only part of a file in C++](https://stackoverflow.com/q/2530274/608639) and [Is modifying a file without writing a new file possible in C++?](https://stackoverflow.com/q/2530443/608639) – jww Feb 18 '18 at 10:19

1 Answers1

3

You are using std::string which, by default, deals with ASCII encoded text, and you are opening your file in "text translation mode". The first thing you need to do is open the file in binary mode so that it doesn't perform translation on individual char values:

in_file.open(file_name.c_str(), std::ios::binary);

or in C++11

in_file.open(file_name, std::ios::binary);

The next thing is to stop using std::string for storing the text from the file. You will need to us a string type that recognizes the character encoding you are using and use the appropriate character type.

As it turns out, std::string is actually an alias for std::basic_string<char>. In C++11 several new unicode character types were introduced, in C++03 there was wchar_t which supports "wide" characters (more than 8 bits). There is a standard alias for basic_strings of wchar_ts: std::wstring.

Start with the following simple test:

#include <iostream>
#include <fstream>
#include <string>

int main() {
    std::string file_name = "D:/TEI/dictionary1.txt";
    std::wifstream in_file(file_name, std::ios::binary);

    if (!in_file.is_open()) {
        // "L" prefix indicates a wide string literal
        std::wcerr << L"file open failed\n";
        return 1;
    }

    std::wstring line1;
    std::getline(in_file, line1);
    std::wcout << L"line1 = " << line1 << L"\n";
}

Note how cout etc also become prefixed with w...

The standard ASCII characterset contains 128 characters numbered 0 thru 127. In ASCII \n and \r are represented with a 7-bit value of 13 and 10 respectively.

Your text file appears to be UTF-8 encoded. UTF-8 uses an 8-bit unsigned representation that allows characters to use a variable number of bytes: the value 0 requires 1 byte, the value 128 requires 2 bytes, the value 8192 requires 3 bytes, and so on.

A value with the highest-bit (2^7) clear is a single, 7-bit ascii value or the end of a multibyte-sequence. If the highest-bit is set, the lower bits are considered to be a "prefix value". So the byte sequence { (128+2), 0 } would represent the value (2 << 7) | 0 or (wchar_t)256. The byte sequence { 130, 13 } represents (2 << 7) | 13 or wchar_t 269.

You can read and write utf-8 values through char streams and storage, but only as opaque byte streams. The moment you start to need to understand the values you generally need to resort to wchar_t, uint16_t or uint32_t etc.

If you are working with Microsoft's toolset (noting the "D:/" path), you may need to look into TCHAR (https://msdn.microsoft.com/en-us/library/c426s321.aspx)

kfsone
  • 23,617
  • 2
  • 42
  • 74
  • The line `std::getline(in_file, line1);` throws an error "No instance of overloaded function 'std::getline' matches the argument list. argument types are: (std::ifstream, std::wstring)" – Spencer Jul 17 '15 at 00:31