2

I have to manually add a UTF-8 BOM to a simple text file. However, I'm not able to write the BOM with the following method. With my rather limited c++ knowledge I actually do not understand what I am doing wrong. I assume that it must be related to the fact that I only write 3 bytes - and the system expects me to write multiples of 2 for whatever reason. The code is compiled in Unicode Character set. Any hints pointing me to the correct direction would be welcome.

FILE* fStream;
errno_t e = _tfopen_s(&fStream, strExportFile, TEXT("wt,ccs=UTF-8"));   //UTF-8

if (e != 0) 
{
    //Error Handling
    return 0;
}

CStdioFile* fileo = new CStdioFile(fStream);
fileo->SeekToBegin();

//Write BOM
unsigned char bom[] = { 0xEF,0xBB,0xBF };
fileo->Write(bom,3);
fileo->Flush();  //BOOM: Assertion failed buffer_size % 2 == 0
Marwie
  • 3,177
  • 3
  • 28
  • 49
  • I don't quite understand your question. From the [_tfopen_s documentation](https://msdn.microsoft.com/en-us/library/z5hh6ee9.aspx): *"Files that are opened for writing in Unicode mode have a BOM written to them automatically."* You are opening the file for writing, and you are enabling Unicode mode, so there doesn't appear to be an immediate need to manually write out a BOM. – IInspectable Jan 23 '17 at 17:26
  • @IInspectable I agree that it is mentionend in the documentation - however, I never experienced the BOM being written automatically when using the lines of code above. – Marwie Jan 24 '17 at 09:17

1 Answers1

3

According to Microsoft's documentation for _tfopen_s (emphasis added):

When a Unicode stream-I/O function operates in text mode (the default), the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).

You are expected to write UTF-16 characters to the file, which will then be translated. Instead of the 3-byte sequence 0xEF,0xBB,0xBF you need to write the single 16-bit 0xfeff.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • You are absolutely right - replacing with `unsigned char bom[] = { 0xff, 0xfe };` and writing 2 bytes fixed the problem. Note that I had to exchange the order of 0xfe and 0xff when storing in the array. Any ideas why? Thank you for pointing my nose on the right paragraph in the documentation ;-) – Marwie Jan 23 '17 at 17:13
  • 1
    @Marwie you need to swap the bytes because [x86 processors are little endian](http://stackoverflow.com/questions/5185551/why-is-x86-little-endian). If you wrote a `uint16_t` or `wchar_t` instead, you wouldn't need to worry about it - the bytes would already be swapped in memory. – Mark Ransom Jan 23 '17 at 17:20