C++ utf8 string to utf16

Question

I'd like to convert a string to a utf-16 wstring. I wrote the following code :

std::string str = "aé"; // a test string with a French character

using cvt_type = std::codecvt_utf8_utf16<wchar_t>;
std::wstring_convert<cvt_type> converter;

std::wstring wstr = converter.from_bytes(str); // exception: std::range_error

The example above works fine with strings containing unicode (for instance std::string str="\u0061\u00e9") or strings without special characters. But my first example doesn't work.

I get the following exception : Microsoft C++ exception: std::range_error at memory location 0x00A9E7E4 The program stops there : converter.from_bytes(str);

When I add the line str[1]=130; // é everything works fine, so I guess that signed chars are the reason for the issue. I need to use a string of signed chars because I want to send the data over tcp sockets.

How to perform the conversion, so that I can send my data using sockets?

Thanks in advance.

Your `std::string` is not UTF-8 encoded when using `"aé"`. That is dependent on the charset you save the source file as. To force UTF-8, use `u8"aé"` instead. Also, note that when `str[0]=97, str[1]=130`, that is not valid UTF-8 encoding, either. In UTF-8, `é` requires 2 `char` values acting together. So, `"aé"` in UTF-8 is encoded as `0x61 0xC3 0xA9` (`97 195 169`), not as `0x61 0x82` (`97 130`) like you imply. — Remy Lebeau, Feb 06 '18 at 01:41
And why are you converting UTF-8 to UTF-16 for sending over a socket? Most Internet protocols use UTF-8 instead, as it is typically more compact than UTF-16 for most languages. You can send a `std::string` as-is over a socket without converting it (ie, `send(sckt, str.c_str(), str.size(), 0)`) — Remy Lebeau, Feb 06 '18 at 01:47
Thank you so much, Remy Lebeau!! Now it works like a charm! I forgot to write it, but I'm writing a chat and I convert from UTF-8 to UTF-16 on the client side (for the GUI) and on the server side (because I need some information about the client, like the nickname, also for the GUI). — winapiwrapper, Feb 06 '18 at 01:54
But I have another question: how to ensure that the string is correctly encoded? If a malicious person sends badly encoded data, the conversion will fail again. — winapiwrapper, Feb 06 '18 at 01:56
I only convert to UTF-16 when receiving the data, otherwise I send UTF-8 strings — winapiwrapper, Feb 06 '18 at 01:58
If a client sends malformed data, they are violating your protocol. All you can really do is either ignore the data completely, or maybe convert non-ASCII chars to `?` for display purposes. `std::wstring_convert` can't do that for you, but [`MultiByteToWideChar()`](https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072.aspx) can. — Remy Lebeau, Feb 06 '18 at 02:01
I remember I had a problem with MultiByteToWideChar, that's why I replaced it with codecvt, but I think I will reuse it. I'm already checking the header, but how to check if the data is correctly encoded or malformed? When codecvt fails, my program stops working, and I'd like to avoid that, so how to find if there are some non UTF-8 characters in my string? — winapiwrapper, Feb 06 '18 at 02:17
If your program stops working when it is sent malformed data, then it is not using very safe/robust code to begin with. Never trust user input, always sanitize it before using it. The only way to check for malformed data is to try to decode the data and see if the decoding fails. `MultiByteToWideChar()` works just fine when used correctly, and it can even drop/replace illegal characters for you. But if you want to know WHERE the data is malformed, you would have to decode the data manually (and decoding UTF-8 is not very hard to implement manually). — Remy Lebeau, Feb 06 '18 at 02:24
It stops working when codecvt reaches an illegal character. I think I will catch the exception so that my program doesn't crash. Or maybe should I use MultiByteToWideChar? What's the most efficient solution? Codecvt or MultiByteToWideChar? — winapiwrapper, Feb 06 '18 at 02:32
"*It stops working when codecvt reaches an illegal character. I think I will catch the exception so that my program doesn't crash*" - obviously, you should have been catching a decoding exception in the first place, yes. Or, disable the exception (by passing a default error string to the `wstring_convert` constructor for `from_bytes()` to return on failure). Or use `MultiByteToWideChar()` (`wstring_convert` might use that internally anyway, depending on STL implementation). What is "most efficient" is subjective. Profile the different approaches and use whatever works best for your needs. — Remy Lebeau, Feb 06 '18 at 02:38
I'm going to correct all those things. Thank you for the precious advice Remy!! — winapiwrapper, Feb 06 '18 at 02:44

C++ utf8 string to utf16

0 Answers0