1

I recently discovered the <codecvt> header, so I wanted to convert between UTF-8 and UTF-16.

I use the codecvt_utf8_utf16 facet with wstring_convert from C++11. The issue I have, is when I try to convert an UTF-16 string to UTF-8, then in UTF-16 again, the endianness changes.

For this code :

#include <codecvt>  
#include <string>  
#include <locale>  
#include <iostream>  

using namespace std;  

int main(int argc, char const *argv[])
{
  wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t>
                                                convert;

  u16string utf16 = u"\ub098\ub294\ud0dc\uc624";

  cout << hex << "UTF-16\n\n";
  for (char16_t c : utf16)
    cout << "[" << c << "] ";

  string utf8 = convert.to_bytes(utf16);

  cout << "\n\nUTF-16 to UTF-8\n\n";
  for (unsigned char c : utf8)
    cout << "[" << int(c) << "] ";
  cout << "\n\nConverting back to UTF-16\n\n";

  utf16 = convert.from_bytes(utf8);

  for (char16_t c : utf16)
    cout << "[" << c << "] ";
  cout << endl;
}

I get this output :

UTF-16

[b098] [b294] [d0dc] [c624]

UTF-16 to UTF-8

[eb] [82] [98] [eb] [8a] [94] [ed] [83] [9c] [ec] [98] [a4]

Converting back to UTF-16

[98b0] [94b2] [dcd0] [24c6]

When I change the third template argument of wstring_convert to std::little_endian, the bytes are reversed.

What did I miss ?

Dante
  • 404
  • 2
  • 10
  • Cannot reproduce: http://coliru.stacked-crooked.com/a/5599be701f3ebb32 – Cubbi Jul 08 '15 at 14:04
  • Thanks for your reply, that's weird, i'm using gcc 5, I will try to compile it from sources tonight, to see if I get the same behaviour. – Dante Jul 08 '15 at 15:39
  • Switching compiler to gcc also doesn't reproduce this on coliru: http://coliru.stacked-crooked.com/a/cbac3e56d8f55c30 – Cubbi Jul 08 '15 at 18:58
  • Well, it works on OS X and Windows, so I guess there is a problem with libstdc++, I will report it as a bug. – Dante Jul 13 '15 at 11:05

1 Answers1

1

It was indeed a bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66855 It will be fixed in 5.3

Dante
  • 404
  • 2
  • 10