3

Starting from C++11 one can convert UTF8 to UTF16 wchar_t (at least on Windows, where wchar_t is 16 bit wide) using std::codecvt_utf8_utf16:

std::wstring utf8ToWide( const char* utf8 )
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    return converter.from_bytes( utf8 );
}

Unfortunately in C++17, std::codecvt_utf8_utf16 is deprecated. But there is std::filesystem::path with all possible conversions inside, e.g. it has members

std::string string() const;
std::wstring wstring() const;
std::u8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;

So the above function can be rewritten as follows:

std::wstring utf8ToWide( const char* utf8 )
{
    return std::filesystem::path( (const char8_t*) utf8 ).wstring();
}

And unlike std::codecvt_utf8_utf16 this will not use any deprecated piece of C++.

What kind of drawbacks can be expected from such converter? For example, path cannot be longer than certain length or certain Unicode symbols are prohibited there?

Fedor
  • 17,146
  • 13
  • 40
  • 131
  • 1
    The UTF16 string types are u16string anc chr16_t, not wstring and wchar_t. Same for UTF8. `char` can be in any encoding. If you check [the documentation of those methods](https://en.cppreference.com/w/cpp/filesystem/path/string) you'll see that they're all either undefined or system-dependent. I suspect the implementation uses whatever method is provided by the OS, just as people did before `codecvt_utf8_utf16` was deprecated but *not* removed. There's no good solution to this – Panagiotis Kanavos Jun 03 '21 at 17:08
  • One can convert in std::u16string using std::filesystem::path as well – Fedor Jun 03 '21 at 17:10
  • 1
    The documentation also makes a distinction between `char` and `char8_t`. `char8_t` is always treated (or should be treated) as UTF8. `char`'s encoding depends on the environment settings – Panagiotis Kanavos Jun 03 '21 at 17:11
  • 1
    @PanagiotisKanavos Technically, `char`'s encoding can be anything regardless of environment settings. But locale aware functions will behave with such assumption. – eerorika Jun 03 '21 at 17:13
  • 1
    Why not just use `MultiByteToWideChar`? – Aykhan Hagverdili Jun 03 '21 at 17:13
  • 2
    @PanagiotisKanavos `There's no good solution to this` While there isn't a good standard solution, I'd say that there is a good solution: Use a library. – eerorika Jun 03 '21 at 17:15
  • 1
    @AyxanHaqverdili because that's a Windows *C* function, not a standard C++ function. The whole point of adding the Unicode types was to finally have Unicode support in C++, not use the OS's functions – Panagiotis Kanavos Jun 03 '21 at 17:15
  • 1
    @eerorika or the OS functions. I wonder what the `path` implementation does. Or keep using `codecvt_utf8_utf16` – Panagiotis Kanavos Jun 03 '21 at 17:16
  • 1
    @PanagiotisKanavos our options are : Using a deprecated part of the library, making an additional copy with `std::filesystem::path`, using `MultiByteToWideChar`. OP seems to be on Windows anyway. – Aykhan Hagverdili Jun 03 '21 at 17:20
  • 3
    _UTF8 to UTF16 conversion using std::filesystem::path_ To me, this sounds somehow like abusing a function for something else. Actually, I still believe that UTF-16 <-> UTF-8 conversion can be achieved by a quite simple bit arithmetic (and this is part of the concept). I would prefer a "hand-knitted" (and carefully tested) function over the deprecated `codecvt_utf8_utf16()` which I even used before the latter became available. (At least, until somebody tells me a valuable reason why I shouldn't.) – Scheff's Cat Jun 03 '21 at 17:21
  • 1
    @Scheff'sCat it that was possible, it wouldn't take a decade to implement nor would `codecvt_utf8_utf16` get deprecated. UTF16 itself can use more than 2 bytes. The hand-crafted method would differ from the OS methods in edge cases which would result in unpleasant surprises for developers and end users. – Panagiotis Kanavos Jun 03 '21 at 17:25
  • 1
    @Scheff'sCat there *are* Unicode-standard libraries, the [ICU libraries](http://site.icu-project.org/home). They're so standard they're shipped with Windows 10 now and [.NET Core switched to them](https://learn.microsoft.com/en-us/dotnet/core/compatibility/globalization/5.0/icu-globalization-api). These aren't part of the *C++* standard though. – Panagiotis Kanavos Jun 03 '21 at 17:28
  • 1
    You can use `std::codecvt` from `C++20`. or `std::codecvt` (system dependent). – Galik Jun 03 '21 at 17:55
  • 2
    @PanagiotisKanavos: "*The hand-crafted method would differ from the OS methods in edge cases which would result in unpleasant surprises for developers and end users.*" ... no, it wouldn't. UTF-8 and UTF-16 are both pretty simple numeric transformations and they are well-defined for the entire 21-bit range of Unicode. This is not difficult, complex code to write. It would take a decent programmer no more than 4 hours to implement such things, and you might add on some time for testing specific cases. – Nicol Bolas Jun 03 '21 at 18:14

1 Answers1

2

What kind of drawbacks can be expected from such converter?

Well, let's get the most obvious drawback out of the way. For a user who doesn't know what you're doing, it makes no sense. Doing UTF-8-to-16 conversion by using a path type is bonkers, and should be seen immediately as a code smell. It's the kind of awful hack you do when you are needlessly averse to just downloading a simple library that would do it correctly.

Also, it doesn't have to work. path is meant for storing... paths. Hence the name. Specifically, they're meant for storing paths in a way easily consumed by the filesystem in question. As such, the string stored in a path can have any limitations that the filesystem wants to put on it, outside of a small plethora of things the C++ standard requires it to do.

For example, if the filesystem is case-insensitive (or even just ASCII-case-insensitive), it is a legitimate implementation to have it just case-convert all strings to lowercase when they are stored in a path. Or to case-convert them when you extract them from a path. Or anything of the like.

path can convert all of your \s into /s. Or your :s into /'s. Or any other implementation-dependent tricks it wants to do.

If you're afraid of using a deprecated facility, just download a simple UTF-8/16 converting library. Or write one yourself; it isn't that difficult.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • *path can convert all of your \s into /s. Or your :s into /'s.* Actually I though that only explicit call to ```std::filesystem::path::make_preferred``` can do so. – Fedor Jun 05 '21 at 12:45