C++ Back and forth conversion between UTF8 and UTF16 using UTF8-CPP (Non codecvt code!)

Question

I'm trying to make GWork (a fork of GWEN GUI) to compile with GCC and I need to be able to convert cross convert UTF-8 and UTF-16 strings.I've found the UTF8-CPP library and so far it looks perfect.

Looking at the UTF8-CPP examples I notice that it uses an std::vector< unsigned short > as storage for UTF-16 strings.

#include <string>
#include <vector>
#include "utf8.h"

std::string nstr = "...";

std::vector< unsigned short > wstrvec;
utf8::utf8to16(nstr.begin(), nstr.end(), std::back_inserter(wstrvec));

std::string utf8str;
utf8::utf16to8(wstrvec.begin(), wstrvec.end(), back_inserter(utf8str));

So now in my Utf8To16() function I have to copy the vector wstrvec into an std::wstring using:

return std::wstring(wstrvec.begin(), wstrvec.end());

And in my Utf16To8() function I would have to copy the data from an std::wstring into an std::vector< unsigned short > and then use it for conversion.

This looks like a waste of memory and computing time (not that it matters) but I'm not even sure it's safe.

So my question is: Can I use std::wstring directly with UTF8-CPP conversion functions instead of std::vector< unsigned short > ?

I really suck at character encoding because I haven't used more than ASCII. But from what I've read so far the std::wstring uses wchar_t to store characters and wchar_t has different sizes on each platform. That's why I'm not even sure my current implementation is safe to be used.

I'm trying to stay away from codecvt because it's not available in GCC and the solution must be cross-platform.

I'm using MinGW GCC 4.8.2 and C++11 can be used. (except codecvt)

Thank you for your time.

Note that on many systems in general, and with gcc in particular, `wchar_t` is 32-bit large. On such systems, `wstring` typically stores a UTF-32 encoded string, not a UTF-16 encoded one. So if you need UTF-16 specifically, `wstring` would be a poor choice. Which is probably why that library of yours chose a different representation. A more modern library might have chosen `std::u16string` instead (new in C++11) — Igor Tandetnik, Mar 16 '14 at 05:26
Is this still a problem? Try adding an issue on the Github site. — Nick, May 18 '16 at 18:46

C++ Back and forth conversion between UTF8 and UTF16 using UTF8-CPP (Non codecvt code!)

0 Answers0