Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

Question

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.

Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:

        Platform::String^ copyPath = copy->Path;
        std::wstring source(copyPath->Data());
        std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
        std::string u8CopyPath = convert.to_bytes(source);

However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?

What makes you think it is corrupted? Can you show as what you see under debugger? (preferably in hex) — mvidelgauz, Jul 07 '16 at 11:05
`Far as I know, UTF-8 is perfectly capable of encoding non-English languages` Yes. But Visual Studio and it's debugger output could be unable to handle UTF8. After all, most of Microsofts products have either no or just limited UTF8 support. — deviantfan, Jul 07 '16 at 13:10
@deviantfan: The problem is not inability to output or handle UTF-8. The issue is, that a `char*` is ambiguous. It could refer to an ASCII string, an ANSI string, or UTF-8, and Visual Studio needs to decide, how to interpret the data. Since there aren't any hints attached, Visual Studio **chooses** to interpret `char*` as MBCS (codepage encoded) strings, as that's the most natural choice on Windows. Besides, you can save (and load) UTF-8 encoded source files with Visual Studio. — IInspectable, Jul 13 '16 at 22:17
@IInspectable `chooses` could be an excuse for everything. MS chose to not support eg. UTF8 in the console, and using *the UTF8 codepages* they created can lead to horrible data loss (But report marked as *wontfix*). Well, yeah. I still call this a problem. Either prevent it or support it. `you can save (and load) UTF-8 encoded source files with Visual Studio` I could give you test cases where it fails. — deviantfan, Jul 14 '16 at 11:04
@deviantfan: What would **you** do, if you were to write a debugger, and had to interpret data identified through a `char*`? Would you not have to choose a character encoding? Would you then, too, call that an excuse? And given the ambiguity, would you not choose the wrong encoding sometimes? This comment is defying logic. — IInspectable, Jul 14 '16 at 11:15
@IInspectable I'm not talking about the visual representation of debugger strings, I'm talking about serious problems inside. ... Well, whatever. — deviantfan, Jul 14 '16 at 11:18
@deviantfan: You were clearly responding to my comment, how Visual Studio's debugger displays strings the way it does. And now you turn around to claim, that you weren't? In that case I guess I'll have to agree with you by saying: *"Well, whatever"*... — IInspectable, Jul 14 '16 at 11:29
@IInspectable Did you read all of my comment? Eg. the part about data loss? Again, my problem never was the visual representation (and yes, I did respond to your comment) — deviantfan, Jul 14 '16 at 11:54
@deviantfan: The Windows console does not claim to provide full UTF-8 support, so not supporting it is to be expected. And the rest really is just about visual representation, because that's the only time, where the specific encoding matters. But then again, the intersection of the set of people talking about *"serious problems"*, and the set of people that understand those problems is empty. — IInspectable, Jul 14 '16 at 12:06
Sigh. Just continue insulting me and misrepresenting the facts ... To give you a bit more to bite on: Yes, the Windows console breaks situations where *nothing is printed to screen and/or received from keyboard input*. Encoding actually does matter in this case even without visuals. That's Microsoft quality software. The problem is reproducible and acknowledged by MS, but not planned to fix.... Bye. — deviantfan, Jul 14 '16 at 12:26
@deviantfan: I never insulted you. You insulted logic. That's something the two of you have to fight out. — IInspectable, Jul 14 '16 at 16:28

score 0 · Answer 1 · answered Jul 08 '16 at 10:26

0

It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

answered Jul 08 '16 at 10:26

legokangpalla

495
5
20

This is normal behavior. The debugger interprets `char*`s as MBCS strings. If it is using a different encoding, you need to provide hints for the debugger, using [C++ format specifiers](https://msdn.microsoft.com/en-us/library/75w45ekt.aspx): `mystring,s8` – IInspectable Jul 12 '16 at 11:34

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

1 Answers1