1

Consider the following code:

#include <string>
#include <fstream>
#include <iomanip>

int main() {
    std::string s = "\xe2\x82\xac\u20ac";
    std::ofstream out("test.txt");
    out << s.length() << ":" << s << std::endl;
    out << std::endl;
    out.close();
}

Under GCC 4.8 on Linux (Ubuntu 14.04), the file test.txt contains this:

6:€€

Under Visual C++ 2013 on Windows, it contains this:

4:€\x80

(By '\x80' I mean the single 8-bit character 0x80).

I've been completely unable to get either compiler to output a character using std::wstring.

Two questions:

  • What exactly does the Microsoft compiler think it's doing with the char* literal? It's obviously doing something to encode it, but what is not clear.
  • What is the right way to rewrite the above code using std::wstring and std::wofstream so that it outputs two characters?
hippietrail
  • 15,848
  • 18
  • 99
  • 158
Tom
  • 7,269
  • 1
  • 42
  • 69
  • 2
    L"\x20ac\x20ac" The encoding of 8-bit strings on Windows is the ambient 8-bit code page, which is 1252 in the United States. You are using utf8. (You are also interpreting the output file as utf8 instead of 1252.) – Raymond Chen Aug 01 '14 at 03:12
  • A fair point - the "it contains this" on Windows is according to Notepad++ with the encoding set to UTF-8. – Tom Aug 01 '14 at 04:45
  • Hmmm, systeminfo gives both system and input locals as "en-gb;English (United Kingdom)", thought whether that's a UTF-8 locale or not it doesn't say. – Tom Aug 01 '14 at 04:48
  • There is no such thing as a UTF-8 locale. Code page 65001 (UTF-8) cannot be the active code page. – Cody Gray - on strike Aug 01 '14 at 07:14
  • So what is "en_GB.utf8"? – Tom Aug 01 '14 at 07:35

1 Answers1

3

This is because you are using \u20ac which is a Unicode character literal in an ASCII string.

MSVC encodes "\xe2\x82\xac\u20ac" as 0xe2, 0x82, 0xac, 0x80, which is 4 narrow characters. It essentially encodes \u20ac as 0x80 because it mapped the euro character to the standard 1252 codepage

GCC is converting the Unicode literal /u20ac to the 3-byte UTF-8 sequence 0xe2, 0x82, 0xac so the resulting string ends up as 0xe2, 0x82, 0xac, 0xe2, 0x82, 0xac.

If you use std::wstring = L"\xe2\x82\xac\u20ac" it gets encoded by MSVC as 0xe2, 0x00, 0x82, 0x00, 0xac, 0x00, 0xac, 0x20 which is 4 wide characters, but since you are mixing a hand-created UTF-8 with a UTF-16, the resulting string doesn't make much sense. If you use a std::wstring = L"\u20ac\u20ac" you get 2 Unicode characters in a wide-string as you'd expect.

The next problem is that MSVC's ofstream and wofstream always write in ANSI/ASCII. To get it to write in UTF-8 you should use <codecvt> (VS 2010 or later):

#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>

int main()
{
    std::wstring s = L"\u20ac\u20ac";

    std::wofstream out("test.txt");
    std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
    out.imbue(loc);

    out << s.length() << L":" << s << std::endl;
    out << std::endl;
    out.close();
}

and to write UTF-16 (or more specifically UTF-16LE):

#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>

int main()
{
    std::wstring s = L"\u20ac\u20ac";

    std::wofstream out("test.txt", std::ios::binary );
    std::locale loc(std::locale::classic(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>);
    out.imbue(loc);

    out << s.length() << L":" << s << L"\r\n";
    out << L"\r\n";
    out.close();
}

Note: With UTF-16 you have to use a binary mode rather than text mode to avoid corruption, so we can't use std::endl and have to use L"\r\n" to get the correct end-of-line text file behavior.

Chuck Walbourn
  • 38,259
  • 2
  • 58
  • 81
  • Thanks for the answer. Am I right in thinking that GCC doesn't support std::codecvt_utf8? – Tom Aug 01 '14 at 07:35
  • 1
    Minor correction: "It encodes \u20ac as 0x80 because Unicode character U+20AC is in position 80 in code page 1252 ([see table](http://en.wikipedia.org/wiki/Windows-1252))." – Raymond Chen Aug 01 '14 at 14:59
  • @Raymond - Excellent. Thanks for the clarification! I'll fix it. – Chuck Walbourn Aug 01 '14 at 17:48