0

I'm working with text-files (UTF-8) on Windows and want to read them using C++.

To open the file corrently, I use fopen. As described here, there are two options for opening the file:

  • Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
  • Binary mode "rb" (The file will be read byte by byte).

Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.

Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?

I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.

Alexander Pacha
  • 9,187
  • 3
  • 68
  • 108
  • 5
    Why do you think binary nmode will affect UTF8? – Alan Stokes Dec 17 '14 at 16:53
  • 1
    The line-endings in the *file* doesn't change, they get translated when you *read* the text. And the reverse translation happens when you write to the file. – Some programmer dude Dec 17 '14 at 16:53
  • @AlanStokes You are absolutely right. I just didn't think about the fact, that the problem is the interpretation (conversion of bytes into a string object) of the binary-stream instead of the fopen itself. – Alexander Pacha Dec 17 '14 at 17:14

1 Answers1

4

The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.

It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380