1

I am experimenting with with file output in utf-16 using wofstream, successfully so far. But I have got a problem to write a new line. As I found out with the Notepad and a hex editor, a new line on windows corresponds to 2 symbols: LineFedd and CarrigeReturn (0x000A and 0x000D). Trying to repriduce this programmatically led to weird result.

#include <fstream>
#include <codecvt>
#include <locale>
#define ENDL L"\u000a\u000d"
using namespace std;
int main()
{
locale utf16(locale(), new codecvt_utf16<wchar_t, 0x10ffffUL, little_endian>());//for writing UTF-16
wofstream fout(L"text.txt");
fout.imbue(utf16);
const unsigned short BOM= 0xFEFF;
fout.write((wchar_t*)&BOM, 1);
fout<<L"some text"<<ENDL<<L"more text";
fout.close();
}

the text that follows ENDL is totally messed up. I found the cause with a hex editor. for ENDL it writes 0D 0A 00 0D 00 . That is, for some reason it writes unnecessary and outrignt harmful 0D byte before the Linefeed character that causes all following bytes to shift to the right and thus messes up the utf-16 encoding.

I don't understand why this happens and how can I fix it

leppie
  • 115,091
  • 17
  • 196
  • 297
Andrey Pro
  • 501
  • 1
  • 10
  • 22
  • 1
    That would seem to be a bug... – Mats Petersson May 27 '14 at 07:50
  • VS 2013 throws [this](http://msdn.microsoft.com/en-us/library/09t1e0z0.aspx) for me, which is correct according to 2.3/2 (f the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.). Still this isn't where your problem is. – user657267 May 27 '14 at 08:10
  • I actually use Intel compiler over MVS which swallows it. but still replacing ENDL definition with L"\r\n" has the same result. – Andrey Pro May 27 '14 at 08:39
  • @AndreyPro looks like you should file a [bug report](https://connect.microsoft.com/visualstudio). – user657267 May 27 '14 at 08:41
  • I don't have a microsoft account andfeel uneasy to sign up to all the services I don't want to use – Andrey Pro May 27 '14 at 11:06

1 Answers1

1

try open your file in binary mode:

std::wofstream fout(L"text", std::ios_base::binary);

I don't have experience with Windows systems but it seems the OS is unhelpfully replacing newlunes with end of line sequences.

Also, I would first imbue() the modified locale and the open() the file: once a character is read, calling imbue() has either no effect or undefined behavior (don't recall which off-hand). I think there is nothing preventing the stream from reading the first buffer upon open(). Idon't think that's your actual problem, though.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • opening in binary mode helps. I wonder though, if it would cause problems with normal text – Andrey Pro May 27 '14 at 08:17
  • The effect of text vs. binary mode is exactly the behavior of replacing newline characters by end of line sequences [on some systems]. – Dietmar Kühl May 27 '14 at 08:19
  • `binary` fixes it, although OP shouldn't be using universal character names for control characters. It looks like a bug in VS2013's `codecvt_utf16` implementation. – user657267 May 27 '14 at 08:20
  • Ok if using binary mode has no adverse side effect I'll just stick with it – Andrey Pro May 27 '14 at 08:41