2

I need to convert utf16 text to utf8. The actual conversion code is simple:

std::wstring in(...);
std::string out = boost::locale::conv::utf_to_utf<char, wchar_t>(in);

However the issue is that the UTF16 is read from a file and it may or may not contain BOM. My code needs to be portable (minimum is windows/osx/linux). I'm really struggling to figure out how to create a wstring from the byte sequence.

EDIT: this is not a duplicate of the linked question, as in that question the OP needs to convert a wide string into an array of bytes - and I need to convert the other way around.

Aleks G
  • 56,435
  • 29
  • 168
  • 265
  • I'm not sure, will [this post](https://stackoverflow.com/questions/2573834/c-convert-string-or-char-to-wstring-or-wchar-t) help? – gongzhitaao Feb 12 '14 at 15:04
  • How do you convert from the `vector` to a `wstring` ? – SirDarius Feb 12 '14 at 15:05
  • @SirDarius Well, this is exactly my question: how to get the `wstring` from the `vector`? – Aleks G Feb 12 '14 at 15:08
  • why use `boost::locale` when you can use `std::locale` (C++11 introduced UTF-8 and UTF-16 character sets)? – qdii Feb 12 '14 at 15:11
  • @qdii I don't care what to use - I need to get a utf8 string from byte array containing utf16. And it needs to be in a portable way (i.e. windows, OSX and unix/linux) – Aleks G Feb 12 '14 at 15:13
  • @gongzhitaao Won't help, as it's windows-specific. I need this to work on linux and windows. – Aleks G Feb 12 '14 at 15:14
  • @qdii No, not a duplicate at all. In that question the OP already has a wide string. I have an array of bytes. – Aleks G Feb 12 '14 at 15:19
  • 1
    @AleksG oh ok, so your title is misleading, you are not struggling with converting utf16 utf8, but with reading from an utf16-encoded file into a widestring. Am I right? – qdii Feb 12 '14 at 15:25
  • @qdii I suppose, yes. I updated the question. – Aleks G Feb 12 '14 at 15:33
  • @AleksG: your question is still misleading. Your code snippet suggests you are reading a UTF-16 encoded file and want to convert it to UTF-8, but what you said to @qdii says you want to read a UTF-16 encoded file and leave it in UTF-16 (that is what `std::wstring` uses on Windows. On some other platforms it uses UTF-32 instead - which is why `wchar_t` is not portable). – Remy Lebeau Feb 13 '14 at 01:30

1 Answers1

2

You should not use wide types at all in your case.

Assuming you can get a char * from your vector<char>, you can stick to bytes by using the following code:

char * utf16_buffer = &my_vector_of_chars[0];
char * buffer_end = &my_vector_of_chars[vector.size()];
std::string utf8_str = boost::locale::conv::between(utf16_buffer, buffer_end, "UTF-8", "UTF-16");

between operates on 8-bit characters and allows you to avoid conversion to 16-bit characters altogether.

It is necessary to use the between overload that uses the pointer to the buffer's end, because by default, between will stop at the first '\0' character in the string, which will be almost immediately because the input is UTF-16.

SirDarius
  • 41,440
  • 8
  • 86
  • 100
  • Hm, interesting thought. I'll give it a try and post back. – Aleks G Feb 13 '14 at 08:47
  • This almost works. Because the string contains latin chars as well (i.e. there are \0 bytes in the vector), I have to explicitly specify the end pointer: `boost::local::conv::between(&my_vector_of_chars[0], &my_vector_of_chars[vector.size()], "UTF-8", "UTF-16")` – Aleks G Feb 13 '14 at 10:01
  • Ouch, makes sense, of course, since there will be zero characters and no end pointer in the form I used in my answer, it's bound to fail, since between is gonna stop at the first null char, gonna fix this. – SirDarius Feb 13 '14 at 10:15