7

Currently I'm writing a plugin which is just a wrapper around an existing library. The plugin's host passes to me an utf-16 formatted string defined as following

typedef unsigned short PA_Unichar;

And the wrapped library accepts only a const char* or a std::string utf-8 formatted string I tried writing a conversion function like

std::string toUtf8(const PA_Unichar* data)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
return std::string(convert.to_bytes(static_cast<const char16_t*>(data));
}

But obviously this doesn't work, throwing me a compile error "static_cast from 'const pointer' (aka 'const unsigned short*') to 'const char16_t *' is not allowed"

So what's the most elegant/correct way to do it?

Thank you in advance.

Robotex
  • 105
  • 1
  • 2
  • 10
  • What's the value of `std::is_same::value` on your platform? Also, which compiler? – moshbear Dec 15 '12 at 13:37
  • `std::is_same::value` has a value of 0 (false) and I'm compiling on Mac with Apple LLVM compiler 4.1 though I also cross compile it with Visual Studio 2012 – Robotex Dec 15 '12 at 15:12
  • 1
    According to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html , `char16_t` is `uint16_least_t`, not `uint16_t`. On your platform, it looks like `uint16_least_t` is *not* aliased to `unsigned short`, thus `sizeof(char16_t) != sizeof(unsigned short)`. `static_cast` will fail on pointer types when the underlying `sizeof`s don't match. – moshbear Dec 15 '12 at 17:43
  • 1
    `char16_t` is 16-bit by definition. If `unsigned short` is being used for UTF-16 then it has to be 16-bit as well. I would either change `PA_Unicode` to `uint16_t` or use `reinterpret_cast` instead of `static_cast`. – Remy Lebeau Dec 15 '12 at 19:16
  • Before I saw the answer I used the latter way, sure I could have replaced the typedef but since I don't maintain the API I cannot afford risking to break the code on every update. I wish the guys who wrote the interface would just use standard types – Robotex Dec 15 '12 at 23:08
  • Now that there are standard `char16_t` and `char32_t` types (which are new, distinct types in C++11, but typedefs for existing types in C11) it's likely that libraries will start using them. The [ICU](http://site.icu-project.org/) library already supports building it as C++11, in which case it uses the standard `charNN_t` types. – Jonathan Wakely Dec 15 '12 at 23:37

1 Answers1

2

You could convert the PA_unichar string to a string of char16_t using the basic_string(Iterator, Iterator) constructor, then use the std::codecvt_utf8_utf16 facet as you attempted:

std::string conv(const PA_unichar* str, size_t len)
{
  std::u16string s(str, str+len);
  std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
  return convert.to_bytes(s);
}

I think that's right. Unfortunately I can't test this, as my implementation doesn't support it yet. I have an implementation of wstring_convert which I plan to include in GCC 4.9, but I don't have an implementation of codecvt_utf8_utf16 to test it with.

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
  • Thank you very much, it seems to work well and this also saved me from some awful type casts :) – Robotex Dec 15 '12 at 23:12
  • Great, I'm glad the compiler in my head got the type checking right! Out of interest, which compiler are you using that supports those classes? – Jonathan Wakely Dec 15 '12 at 23:33
  • I'm compiling with the LLVM 4.1 compiler on Mac-based systems (after setting the flag `-std=c++11`) and Visual Studio 2012 on windows systems – Robotex Dec 16 '12 at 00:05
  • 2
    Thanks for the into - I guess I'd better finish my GCC implementation then if the competition has it! I've not seen any demand for the classes, I don't think most people even know they exist – Jonathan Wakely Dec 18 '12 at 20:46
  • 1
    Drat, codecvt_utf8_utf16 isn't in gcc 4.8. Hopefully, Jonathan Wakely hit the gcc 4.9 cutoff. Too late for me. – BSalita May 15 '14 at 15:50