1

There is a plenty of questions on SO regarding this, but most of them do not mention writing wstring back to file. So for example I found this for reading:

// open as a byte stream
std::wifstream fin("/testutf16.txt", std::ios::binary);
// apply BOM-sensitive UTF-16 facet
fin.imbue(std::locale(fin.getloc(),
    new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
// read  
std::wstring ws;
for(wchar_t c; fin.get(c); )
{
    std::cout << std::showbase << std::hex << c << '\n';
    ws.push_back(c);
}

I tried similar stuff for writing:

    std::wofstream wofs("/utf16dump.txt", std::ios::binary);
    wofs.imbue(std::locale(wofs.getloc(),
        new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    wofs << ws;

but it produces garbage, (or Notpad++ and vim cant interpret it). As mentioned in the title Im on Win, native C++, VS 2010.

Input file:

t€stUTF16✡
test

This is what is the result:

t€stUTF16✡
test

convert to hex:

0000000: 7400 ac20 7300 7400 5500 5400 4600 3100  t.. s.t.U.T.F.1.
0000010: 3600 2127 0d00 0a00 7400 6500 7300 7400  6.!'....t.e.s.t.
0000020: 0a                                       
                     ...

vim normal output:

t^@¬ s^@t^@U^@T^@F^@1^@6^@!'^M^@ ^@t^@e^@s^@t^@

EDIT: I ended up using UTF8. Andrei Alexandrescu says it is the best encoding so no big loss. :)

ST3
  • 8,826
  • 3
  • 68
  • 92
NoSenseEtAl
  • 28,205
  • 28
  • 128
  • 277
  • 1
    Don't just tell us it's garbage, provide a hex dump of the first 80 bytes or so, along with what you expected the contents to be. – Ben Voigt Jun 08 '12 at 15:35
  • 2
    There is just the BOM (Byte Order Marker) missing in yur file. This marker is used by editors to determine that your file is UTF16. – Totonga Jun 08 '12 at 15:50
  • @ Totonga is feff 7400 ac20 7300 ok begining? I still get chinese letters in N++, and ^@ in vim. – NoSenseEtAl Jun 08 '12 at 15:59
  • No, that's not ok. The data is little endian, so the byte order mark should be little endian also, that is FFFE. – Ben Voigt Jun 08 '12 at 16:14
  • 1
    With vim, you can use `set encoding` to set the encoding to anything regardless of byte order markers. If it really is UTF16, doing `set encoding=utf16` should make it legible. – Gort the Robot Jun 08 '12 at 17:02

3 Answers3

3

Your similar code -- isn't. You removed the std::ios::binary style, despite the fact that the documentation says

The byte stream should be written to a binary file; it can be corrupted if written to a text file.

NL->CRLF conversion in ASCII mode isn't going to do pretty things to UTF-16 files, since it will insert one byte 0x0D instead of two bytes 0x00 0x0D.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
2

It is easy if you use the C++11 standard (because there are a lot of additional includes like "utf8" which solves this problems forever).

But if you want to use multi-platform code with older standards, you can use this method to write with streams:

  1. Read the article about UTF converter for streams
  2. Add stxutif.h to your project from sources above
  3. Open the file in ANSI mode and add the BOM to the start of a file, like this:

    std::ofstream fs;
    fs.open(filepath, std::ios::out|std::ios::binary);
    
    unsigned char smarker[3];
    smarker[0] = 0xEF;
    smarker[1] = 0xBB;
    smarker[2] = 0xBF;
    
    fs << smarker;
    fs.close();
    
  4. Then open the file as UTF and write your content there:

    std::wofstream fs;
    fs.open(filepath, std::ios::out|std::ios::app);
    
    std::locale utf8_locale(std::locale(), new utf8cvt<false>);
    fs.imbue(utf8_locale); 
    
    fs << .. // Write anything you want...
    
Dr1Ku
  • 2,875
  • 3
  • 47
  • 56
Yarkov Anton
  • 639
  • 6
  • 11
1

For output, you want to use generate_header instead of consume_header.

See http://en.cppreference.com/w/cpp/locale/codecvt_mode

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • got asian stuff: 琀갠猀琀唀吀䘀㄀㘀℧ഀ਀琀攀猀琀 – NoSenseEtAl Jun 08 '12 at 15:47
  • 1
    @NoSenseEtAl: Do not post "text". The hex dump of the file is the *only* thing that gives a clue to what is going wrong. – Ben Voigt Jun 08 '12 at 15:57
  • 1
    0000000: feff 7400 ac20 7300 7400 5500 5400 4600 ..t.. s.t.U.T.F. 0000010: 3100 3600 2127 0d00 0a00 7400 6500 7300 1.6.!'....t.e.s. 0000020: 7400 0a – NoSenseEtAl Jun 08 '12 at 15:59
  • @NoSenseEtAl: I don't know how, but you managed to get the byte order of the data and the byte order of the BOM mismatched. After swapping the first two bytes, notepad opens it just fine. – Ben Voigt Jun 08 '12 at 16:08
  • here is my entire code if you want it: int main() { // open as a byte stream std::wifstream fin("/testutf16.txt", std::ios::binary); // apply BOM-sensitive UTF-16 facet fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16)); // read std::wstring ws; for(wchar_t c; fin.get(c); ) { std::cout << std::showbase << std::hex << c << '\n'; ws.push_back(c); } std::wofstream wofs("/utf16dump.txt",std::ios::binary); wofs.imbue(std::locale(wofs.getloc(), new std::codecvt_utf8)); wofs << ws; } – NoSenseEtAl Jun 08 '12 at 16:50