UTF8 to UTF16 conversion using std::filesystem::path

Question

Starting from C++11 one can convert UTF8 to UTF16 wchar_t (at least on Windows, where wchar_t is 16 bit wide) using std::codecvt_utf8_utf16:

std::wstring utf8ToWide( const char* utf8 )
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    return converter.from_bytes( utf8 );
}

Unfortunately in C++17, std::codecvt_utf8_utf16 is deprecated. But there is std::filesystem::path with all possible conversions inside, e.g. it has members

std::string string() const;
std::wstring wstring() const;
std::u8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;

So the above function can be rewritten as follows:

std::wstring utf8ToWide( const char* utf8 )
{
    return std::filesystem::path( (const char8_t*) utf8 ).wstring();
}

And unlike std::codecvt_utf8_utf16 this will not use any deprecated piece of C++.

What kind of drawbacks can be expected from such converter? For example, path cannot be longer than certain length or certain Unicode symbols are prohibited there?

The UTF16 string types are u16string anc chr16_t, not wstring and wchar_t. Same for UTF8. `char` can be in any encoding. If you check [the documentation of those methods](https://en.cppreference.com/w/cpp/filesystem/path/string) you'll see that they're all either undefined or system-dependent. I suspect the implementation uses whatever method is provided by the OS, just as people did before `codecvt_utf8_utf16` was deprecated but *not* removed. There's no good solution to this — Panagiotis Kanavos, Jun 03 '21 at 17:08
One can convert in std::u16string using std::filesystem::path as well — Fedor, Jun 03 '21 at 17:10
The documentation also makes a distinction between `char` and `char8_t`. `char8_t` is always treated (or should be treated) as UTF8. `char`'s encoding depends on the environment settings — Panagiotis Kanavos, Jun 03 '21 at 17:11
@PanagiotisKanavos Technically, `char`'s encoding can be anything regardless of environment settings. But locale aware functions will behave with such assumption. — eerorika, Jun 03 '21 at 17:13
@PanagiotisKanavos `There's no good solution to this` While there isn't a good standard solution, I'd say that there is a good solution: Use a library. — eerorika, Jun 03 '21 at 17:15
@AyxanHaqverdili because that's a Windows *C* function, not a standard C++ function. The whole point of adding the Unicode types was to finally have Unicode support in C++, not use the OS's functions — Panagiotis Kanavos, Jun 03 '21 at 17:15
@eerorika or the OS functions. I wonder what the `path` implementation does. Or keep using `codecvt_utf8_utf16` — Panagiotis Kanavos, Jun 03 '21 at 17:16
@PanagiotisKanavos our options are : Using a deprecated part of the library, making an additional copy with `std::filesystem::path`, using `MultiByteToWideChar`. OP seems to be on Windows anyway. — Aykhan Hagverdili, Jun 03 '21 at 17:20
_UTF8 to UTF16 conversion using std::filesystem::path_ To me, this sounds somehow like abusing a function for something else. Actually, I still believe that UTF-16 <-> UTF-8 conversion can be achieved by a quite simple bit arithmetic (and this is part of the concept). I would prefer a "hand-knitted" (and carefully tested) function over the deprecated `codecvt_utf8_utf16()` which I even used before the latter became available. (At least, until somebody tells me a valuable reason why I shouldn't.) — Scheff's Cat, Jun 03 '21 at 17:21
@Scheff'sCat it that was possible, it wouldn't take a decade to implement nor would `codecvt_utf8_utf16` get deprecated. UTF16 itself can use more than 2 bytes. The hand-crafted method would differ from the OS methods in edge cases which would result in unpleasant surprises for developers and end users. — Panagiotis Kanavos, Jun 03 '21 at 17:25
@Scheff'sCat there *are* Unicode-standard libraries, the [ICU libraries](http://site.icu-project.org/home). They're so standard they're shipped with Windows 10 now and [.NET Core switched to them](https://learn.microsoft.com/en-us/dotnet/core/compatibility/globalization/5.0/icu-globalization-api). These aren't part of the *C++* standard though. — Panagiotis Kanavos, Jun 03 '21 at 17:28
You can use `std::codecvt` from `C++20`. or `std::codecvt` (system dependent). — Galik, Jun 03 '21 at 17:55
@PanagiotisKanavos: "*The hand-crafted method would differ from the OS methods in edge cases which would result in unpleasant surprises for developers and end users.*" ... no, it wouldn't. UTF-8 and UTF-16 are both pretty simple numeric transformations and they are well-defined for the entire 21-bit range of Unicode. This is not difficult, complex code to write. It would take a decent programmer no more than 4 hours to implement such things, and you might add on some time for testing specific cases. — Nicol Bolas, Jun 03 '21 at 18:14

Nicol Bolas · Accepted Answer · 2021-06-03T18:17:43.473

What kind of drawbacks can be expected from such converter?

Well, let's get the most obvious drawback out of the way. For a user who doesn't know what you're doing, it makes no sense. Doing UTF-8-to-16 conversion by using a path type is bonkers, and should be seen immediately as a code smell. It's the kind of awful hack you do when you are needlessly averse to just downloading a simple library that would do it correctly.

Also, it doesn't have to work. path is meant for storing... paths. Hence the name. Specifically, they're meant for storing paths in a way easily consumed by the filesystem in question. As such, the string stored in a path can have any limitations that the filesystem wants to put on it, outside of a small plethora of things the C++ standard requires it to do.

For example, if the filesystem is case-insensitive (or even just ASCII-case-insensitive), it is a legitimate implementation to have it just case-convert all strings to lowercase when they are stored in a path. Or to case-convert them when you extract them from a path. Or anything of the like.

path can convert all of your \s into /s. Or your :s into /'s. Or any other implementation-dependent tricks it wants to do.

If you're afraid of using a deprecated facility, just download a simple UTF-8/16 converting library. Or write one yourself; it isn't that difficult.

*path can convert all of your \s into /s. Or your :s into /'s.* Actually I though that only explicit call to ```std::filesystem::path::make_preferred``` can do so. — Fedor, Jun 05 '21 at 12:45

UTF8 to UTF16 conversion using std::filesystem::path

1 Answers1