0

I was playing with std::wstring and std::wfstream, when I encountered a strange behaviour. Namely, it appears that std::basic_string<wchar_t>::find fails to find certain characters. Consider the following code:

int main()
{
    std::wifstream input("input.txt");
    std::wofstream output("output.txt");

    if(!(input && output)){
        std::cerr << "file(s) not opened";
        return -1;
    }

    std::wstring buf;
    std::getline(input, buf);

    output << buf;

    std::cout << buf.find(L'ć');
}

Here I am simply reading the first line of the input file and writing it to the output file. Before the program runs, the content of the first file is aąbcćd and the output file is empty. After executing the code, the input file is successfully copied into the output file.

What's surprising to me is that I tried to find a ć letter in the buf and encountered the mentioned strange behaviour. After the program executed, I confirmed that the output file contains exactly aąbcćd, which obviously contains the mentioned character ć.

However, the line std::cout << buf.find(L'ć') did not behave as expected. I wasn't expecting to get an output of 4, given the memory layout of std::wstring, but I also definitely did not expect to get std::string::npos. It's worth mentioning that finding regular ASCII characters with this method succeeds.

To sum up, the mentioned code correctly copies the first line of input file to output file, but it fails to find a character in a string (returning npos), that is responsible of holding the data that is to be copied. Why is that so? What causes the find to fail here?

Note: both of the files are UTF-8 encoded on Windows.

Fureeish
  • 12,533
  • 4
  • 32
  • 62
  • Have you used a debugger to determine what data is stored within `buf`? Or anything else to determine what is within `buf`? Have you examined the binary contents of the text file to determine what that `ć` is encoded as? Have you examined the integer value of `ć` within your C++ program? – Yakk - Adam Nevraumont Jun 27 '18 at 14:00
  • `sizeof(wchar_t) == 2` on Windows. You can imaging reading such a string from a file encoded in UTF-8 will not give you what you want. – DeiDei Jun 27 '18 at 14:02
  • @DeiDei I can imagine that it might corrupt the string, but I cannot image then how come only the non-ASCII characters become corrupted – Fureeish Jun 27 '18 at 14:20

1 Answers1

1

Unfortunately wchar_t isn't UTF-8, its UTF-16(on Windows) and no magic conversion happens when you read a UTF-8 file. If you debug your program you'll see corrupted characters in your buf variable.

You either need to read your string as a std::string then convert from UTF-8 to whar_t or work in UTF-8 and convert your literal string from whcar_t to std::string of UTF-8 characters.

If you are using a recent compiler you can use the following to create a UTF-8 string literal:

u8"ć"

The following should work:

int main()
{
    std::ifstream input("input.txt");
    std::ofstream output("output.txt");

    if(!(input && output)){
        std::cerr << "file(s) not opened";
        return -1;
    }

    std::string buf;
    std::getline(input, buf);

    output << buf;

    std::cout << buf.find(u8"ć");
}
Alan Birtles
  • 32,622
  • 4
  • 31
  • 60
  • I can see why this might be an issue. Also, the corrupted characters are only those which are non-ASCII. How come when read from the file, these characters become 'corrupted' but when tried to be compared, they yield different values? I would imagine the corruption to be applied in the same manner. Could you please provide an example how what I tried to do could be achieved? – Fureeish Jun 27 '18 at 14:19
  • The easiest way is to use the utf8cpp library: http://utfcpp.sourceforge.net/. ASCII characters probably work due to the standard library doing some sort of automatic widening – Alan Birtles Jun 27 '18 at 14:24
  • Thank you. This clarified few things. – Fureeish Jun 27 '18 at 14:31
  • FYI only, if you do need to convert from UTF8 to UTF16 on Windows, the `MultibyteToWideChar` function does that. – Peter Torr - MSFT Jul 29 '18 at 20:32