How to convert between UTF-8 and TCHAR generically

Question

I know I've seen this on the web, but when I searched for it I only found this example. However that code is only valid if UNICODE is defined, the one I thought I saw had conditionals for when UNICODE was not defined (and I think it had a third case as well - can it be whether MBCS is defined).

The questions are either:

Can I find the source I thought I saw somewhere?
Am I correct that the three cases are the only ones one have to handle, and how is conversion done in the non-UNICODE case?

The purpose of the conversion is for use with the windows API.

No, there are only two cases. The straight forward one (UTF-8 to UTF-16), and the impossible one (UTF-8 to MBCS). MBCS cannot represent all code points that UTF-8 can encode. P.S.: The Windows API doesn't use `TCHAR`s, except for compatibility with Win9x versions of Windows. If you are targeting any supported version of Windows (or any unsupported Windows NT based version), you simply pass `wchar_t*` in place of `TCHAR*` (with very few exceptions, that take `char*`). — IInspectable, Nov 19 '15 at 15:28
P.P.S.: The correct way to deal with UTF-8 on Windows is to convert to UTF-16, when data enters the application (socket, file, pipe, etc.), and convert to UTF-8, when it leaves the application. Inside the application there should only ever be UTF-16 encoded string data. The few exceptions are internal Windows APIs, that only deals with ANSI strings (like [GetProcAddress](https://msdn.microsoft.com/en-us/library/windows/desktop/ms683212.aspx)). — IInspectable, Nov 19 '15 at 15:42
An alternative view is keep everything as UTF-8 internally and convert to UTF-16 when accessing Windows API functions. See [UTF-8 Everywhere](http://www.utf8everywhere.org/) and [Boost.Nowide](http://cppcms.com/files/nowide/html/index.html). That's all C++ though; the problem is harder to solve in C. — Ian Abbott, Nov 19 '15 at 16:03
@IInspectable: That is bad advice and precludes both writing portable code and clean round-trip handling of data files which might contain non-UTF-8 junk (which is junk, but which should not be corrupted by your program). — R.. GitHub STOP HELPING ICE, Nov 19 '15 at 16:10
@IanAbbott: The problem is (or will soon be) easily solved in plain C using [midipix](http://midipix.org/). — R.. GitHub STOP HELPING ICE, Nov 19 '15 at 16:10
@R..: I wasn't proposing to convert **any** octet stream to UTF-16. Obviously, this was meant for UTF-8 encoded streams only. Also obvious, not every octet stream represents valid UTF-8, so impossibility precludes conversion to UTF-16 in the general case. — IInspectable, Nov 19 '15 at 16:23
@IInspectable: What I was trying to express is that lots of *nominally* UTF-8 text actually contains junk, and corrupting that further is very bad behavior for an application. — R.. GitHub STOP HELPING ICE, Nov 19 '15 at 18:01
@R..: Why? It's just following the "Junk in, junk out" rule. What damage is done? If anything, it's an additional check, that allows your application to fail early. — IInspectable, Nov 19 '15 at 18:17
@IInspectable I don't see the advantage of using UTF-16 inside the application, but I don't intend to start a holy war about that. I had the impression that window used fixed width 16-bit encoding btw. Regarding only two cases, what does the `MBCS` being defined or not defined mean then? Regarding impossibility of the conversion I realize that one would have to use some error handling (so it becomes possible again). — skyking, Nov 19 '15 at 19:20
@skyking: If neither `UNICODE` nor `MBCS` is defined, the default is SBCS (ASCII) (see [Generic-Text Mappings in Tchar.h](https://msdn.microsoft.com/en-us/library/c426s321.aspx)). Since ASCII is a subset of `MBCS` there's no need to handle it explicitly, and you have only two cases to worry about. Note that the preprocessor symbols without a leading underscore control the Windows API set to use. The respective symbols with a leading underscore (`_UNICODE` and `_MBCS`) control the generic text mappings for the CRT. The symbols should correspond in your project settings. — IInspectable, Nov 19 '15 at 19:44
@skyking: Btw. Windows supports supplementary characters as well (see [Unicode](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081.aspx)). The fixed 16-bit encoding was in use, when work on Windows NT started. At that time, UTF-16 and UCS2 were identical. Things have changed since then. — IInspectable, Nov 19 '15 at 20:01
@IInspectable: Silently corrupting data is not acceptable even if you think the original data is bad. You'll destroy chances of the user recovering it. Lots of text editors have this flaw. GNU Emacs is one of the few that does it right. XEmacs (at least last I used it) silently corrupts even valid UTF-8 if it contains any unassigned codepoints. — R.. GitHub STOP HELPING ICE, Nov 20 '15 at 00:34
@R..: Who was talking about silently corrupting data!? I wasn't. I was talking about converting UTF-8 to UTF-16. Error checking goes without saying. And if it fails, do have the dignity to go up in flames. Without second thoughts. Don't judge my recommendations by the experience you had with poorly crafted tools, that desperately try to recover from the unrecoverable. — IInspectable, Nov 20 '15 at 00:54
@IInspectable: I'm sorry if I came across hostile. I just think converting text to an internal form different from its original form has lots of pitfalls and risks that someone may introduce (often silent) data loss/corruption, even if you wouldn't do that yourself. — R.. GitHub STOP HELPING ICE, Nov 20 '15 at 01:14
@R..: [MultiByteToWideChar](https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072.aspx) knows UTF-8 and can be instructed to dismiss invalid input. If the input claims to be UTF-8, but isn't valid UTF-8, then all odds are off, and the best you can do (and the only safe option) is to fail. If the input is UTF-8, then the roundtrip UTF-8 -> UTF-16 -> UTF-8 is lossless. Stating that some text editors don't know text is inconclusive, and doesn't advance the discussion at hand. Windows is natively UTF-16, and leaving things native helps keeping the doctor away. — IInspectable, Nov 20 '15 at 01:53
@IInspectable: If you just do conversion for presentation, you can always keep the original data form even if it contains errors. And "Windows is natively UTF-16" is not a compelling argument at all unless your whole program (rather than just the UI layer) is highly coupled with Windows, which is really bad design. — R.. GitHub STOP HELPING ICE, Nov 20 '15 at 02:52
I've not seen any comment that could be a motivation for the downvote. Anyone that want to provide feedback as to why this was downvoted? — skyking, Nov 20 '15 at 06:18

How to convert between UTF-8 and TCHAR generically

0 Answers0