Are there any dangers to working internally in UTF-8 and then converting to UTF-16 only when needed in Windows?

Question

Visual studio tries to insist on using tchars, which when compiled with the UNICODE option then basically ends up using the wide versions of the Windows and other API.

Is there then any danger to using UTF-8 internally in the application (which makes use of the C++ STL easier and also enables more readable cross platform code) and then only converting to UTF-16 when you need to use any of the OS APIs?

I'm specifically asking about developing for more than one OS - Windows that doesn't use UTF-8 and others like Mac, that do.

Just throwing this out there, it makes more sense to just use the OS's default wide encoding in the application and re-encode as UTF-8 if you're transferring data to another machine, unless you're using an API that requires UTF-8 strings. — Collin Dauphinee, Mar 08 '12 at 19:56
I'm specifically asking about the case where you're developing for 2 OSes : Mac & Win, for example. Mac uses UTF-8 and Windows does not. — Carl, Mar 08 '12 at 19:59
I don't do OS X development, but I thought their strings were UCS-2 encoded? Either way, I would try to make the code work in the platform's native encoding through typedefs or defines. — Collin Dauphinee, Mar 08 '12 at 20:02
@ravenspoint: I have come across problems on Visual Studio 2005 where trying to use the wide version of the STL does not produce the expected results. Not sure if this is still an issue in VS2010 and the upcoming VS2011, but I'd rather stay away from obscure bugs if I can help it :) — Carl, Mar 08 '12 at 20:07
@ravenspoint: Because the Standard C and C++ libraries treat `char*` as the default string type and `wchar_t*` as an afterthought. — dan04, Mar 08 '12 at 20:10
@dauphic: Depends on which API you use. Carbon and Cocoa (the preferred APIs for application development) do use UTF-16, but the POSIX APIs use UTF-8. — Philipp, Mar 08 '12 at 20:44
@carleeto: What do you mean by "the wide version of the STL"? The STL is a set of templates which are by definition independent of any concrete type. — Philipp, Mar 08 '12 at 20:47
@Philipp: Technically, the POSIX APIs use a locale-specified encoding (in Windows terminology, "ANSI") like they've always done. It's just that modern Unix-like OSes have switched to using UTF-8 by default. — dan04, Mar 09 '12 at 01:49
@dauphic I am not sure it makes more sense: in terms of performance, OS API calls with textual parameters are mostly UI and file system, both are outweighted by common machine-to-machine channels like communication (e.g. TCP) and file IO in terms of characteristic number of characters. Therefore, I believe it is better to have all strings UTF-8 by default, and not OS native encoding. — Pavel Radzivilovsky, Jun 07 '12 at 13:55

score 2 · Accepted Answer · answered Mar 08 '12 at 20:04

As others have said, there is no danger to using UTF-8 internally, and then converting when you need to call Windows functions.

However, be aware that the cost of converting every time so might become prohibitively expensive if you're displaying a lot of text. (Remember, you don't just have the conversion, but you may also have the cost of allocating and freeing buffers to hold the temporary, converted strings.)

I should also point out there is wide-character support built in to STL, so there's really no reason for doing this. (std::wstring, et al.)

Additionally, working exclusively with UTF-8 is fine for English, but if you plan on supporting Eastern European, Arabic, or Asian character sets your storage requirements for text might turn out to be larger than those for UTF-16 (due to more characters requiring three or four code points to be stored). Again this will probably only be an issue if you're dealing with large volumes of text, but it's something to consider - doubly so if you're going to be transferring this text over a network connection at any time.

I am also adding this link as a comment, because it contains a lot of relevant information: http://utf8everywhere.org/ — Carl, May 08 '12 at 04:40

score 1 · Answer 2 · edited Apr 12 '17 at 07:31

No, there are no dangers if you follow the guidelines.^[1] In fact it's the sanest and simplest way to go,^[2] even if you write for Windows only.

And note that UTF-8 is never any longer than UTF-16 for European languages, nor for non-BMP characters. It takes more space only for codepoints encoded with 3 bytes in UTF-8 and 2 in UTF-16, which is precisely the U+0800 to U+FFFF range,^[3] which is mostly CJK characters.

score 1 · Answer 3 · answered Mar 08 '12 at 19:55

1

Since UTF-8 and UTF-16 are merely two ways of encoding numbers (which are then interpreted as so called code-points or glyphs) there is nothing wrong with converting back and forth: no information is lost. So no, there's no danger in converting (as long as the conversion is correct, of course).

answered Mar 08 '12 at 19:55

DarkDust

90,870
19
190
224

1

Of course there is no risk in any software as long as it is correct. :) There is a risk in these conversions because, for one thing, they are going to involve memory management. Take the UTF-8, allocate a buffer long enough to hold the UTF-16, etc. It's inefficient too. If we have a `std::wstring x;` we can just do `x.c_str()` to get the `const wchar_t *` string which we can pass to an OS API function directly. – Kaz Mar 08 '12 at 20:59

score 1 · Answer 4 · edited Jun 20 '20 at 09:12

1

I'm assuming your project is not about text processing, manipulation or transformation: For text processing, it is far easier to chose one and only one encoding, the same on all platforms, and then do the conversion if needed when using the native API.

But if your project is not centered around text processing/manipulation/transformation, then the restriction to UTF-8 on all platforms is not the simpliest solution.

Avoid using `char` on Windows

If you work with the char type on Windows development, then all the WinAPI will use char.

The problem is that the char type on Windows is used for "historical" applications, meaning pre-unicode application.

Every char text is interpreted as a non-Unicode text whose encoding/charset is chosen by the Windows user, not you the developper.

Meaning: If you believe you're working with UTF-8, send that UTF-8 char text to the WinAPI to output on GUI (and TextBox, etc.), and then execute your code on a Windows set up on Arabic (for example), then you'll see your pretty UTF-8 char text won't be handled correctly by the WinAPI because the WinAPI on that Windows believes all the char are to be interpreted as Windows-1256 encoding.

If you're working with char on Windows, you're forsaking Unicode unless every call to the WinAPI goes through a translation (usually through a Framework like GTK+, QT, etc., but it could be your own wrapper functions).

Optimization is the Root of all Evil, but then, converting all your UTF-8 texts from and to UTF-16 each time you discuss with Windows does seems to me to be quite an useless pessimization.

Alternative: Why not using TCHAR on all platforms?

What you should do is work with TCHAR, provide a header similar to tchar.h for Linux/MacOS/Whatever (redeclaring the macros, etc. in the original tchar.h header), augmenting it with a tchar.h-like header for the Standard Library objects you want to use. For example, my own tstring.hpp goes like:

// tstring.hpp
#include <string>
#include <sstream>
#include <fstream>
#include <iostream>

#ifdef _MSC_VER
#include <tchar.h>
#include <windows.h>
#else
#ifdef __GNUC__
#include <MyProject/tchar_linux.h>
#endif // __GNUC__
#endif


namespace std
{

#ifdef _MSC_VER

   // On Windows, the exact type of TCHAR depends on the UNICODE and
   // _UNICODE macros. So the following is useful to complete the
   // tchar.h headers with the C++ Standard Library's symbols.

   #ifdef UNICODE

      typedef              wstring        tstring ;
      // etc.
      static wostream &    tcout          = wcout ;

   #else // #ifdef UNICODE

      typedef              string         tstring ;
      // etc.
      static ostream &     tcout          = cout ;

   #endif // #ifdef UNICODE

#else // #ifdef _MSC_VER

    #ifdef __GNUC__

    // On Linux, char is expected to be UTF-8 encoded, so the
    // following simply maps the txxxxx type into the xxxxx
    // type, forwaking the wxxxxx altogether.
    // Of course, your mileage will vary, but the basic idea is
    // there.

    typedef                string         tstring ;
    // etc.
    static ostream &       tcout          = cout ;

    #endif // __GNUC__

#endif // #ifdef _MSC_VER

} // namespace std

Discplaimer: I know, it's evil to declare things in std, but I had other things to do than be pedantic on that particular subject.

Using those headers, you can use the C++ Standard Library combined with the TCHAR facility, that is, use std::tstring, which will be compiled as std::wstring on Windows (provided you compile defining theUNICODE and _UNICODE defines) and as std::string on the other char-based OSes you want to support.

Thus, you'll be able to use the platform's native character type at no cost whatsoever.

As long as you are agnostic with your TCHAR character type, there won't be any problem.

And for the cases you really want to deal with the dirty side of UTF-8 vs. UTF-16, then you need to provide the code for conversion (if needed), etc..

This is usually done by providing overloads of the same function for different types, and for each OS. This way, the right function is selected at compile time.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 08 '12 at 20:34

paercebal

81,378
38
130
159

3

The whole tchar work around on non-Windows like OSes is precisely what I'm trying to avoid, because to be honest, its a shame that a poor choice on one OS should make code on another OS harder to understand. – Carl Mar 08 '12 at 20:43
1

@carleeto : You're wrong about the "poor choice". For Microsoft, the support of legacy applications is of the upmost importance. And before Unicode, localization was handled through charsets and encodings. When Windows went Unicode (around Win95/Win98), deciding char would now be UTF-8 would break all the applications in all the world (including the non-Unicode 300-DLL/EXE app I'm working on right now). So they decided to let the `char` use charsets, and use `wchar_t` for Unicode. Note that Java uses UTF-16, too... – paercebal Mar 08 '12 at 20:53
@carleeto: ... So, the use of TCHAR on Windows was a way to provide an easy transition from charsets to Unicode... – paercebal Mar 08 '12 at 20:59
1

And Windows and Java are not the only one working with UTF-16. One person on this question mentionned OS-X being on UTF-16, too. I know the XML DOM is standardized around UTF-16 strings. So, why Linux is based on UTF-8? Linux, had it easier as, unlike Windows, no one expects one Linux program to execute flawlessly on two or three decades of versions of Linux. No, what is expected instead is: "If you want your code to work on that OS version, then compile it for that OS version". So, recycling the `char` to make it UTF-8 was an easy choice, not a genius design. – paercebal Mar 08 '12 at 21:02
3

That's my point though. If you do need to support unicode on Windows, you need to use UTF-16. You don't need to do so on other OSes. In fact, file names on other OSes like Mac and Linux are UTF-8. So the way I see it, convert to UTF-16 only when you need to on Windows and use standard types everywhere else. Hence my question. I agree that memory management and performance do come into play, but they are separate problems. – Carl Mar 08 '12 at 21:33
1

@carleeto : The way I see it, you're writing C++ code. While I understand how the idea of using a runtime conversion interface is seducing at first, you are actively designing your project around innefficient code on Windows (my 1st point in my answer), in a language that could easily support compile time efficient code on all platforms (my 2nd point), just because you shy away from a TCHAR typedef. Those are not separate problems unless you really *know* now the conversion will not happen enough to be a bottleneck. If you don't know that now, then, it's your bet... :-) – paercebal Mar 08 '12 at 22:48
Actually, I'd say Windows had it easier because it's a single vendor OS. The Plan 9 developers had the constraint of API compatibility with other versions of Unix. – dan04 Mar 09 '12 at 02:14
1

@dan04: API compatibility is one thing, but *binary* compatibility, including accross multiple OSes (DOS, Win9x, WinNT, 32/64-bit, etc.) is something of another magnitude IMHO. And I'm not even mentionning the hacks introduced in Windows to support buggy programs from the past in today's OS: Raymond Chen has an entire blog dedicated to that. – paercebal Mar 09 '12 at 09:33
1

@paercebal I see no harm in working under _UNICODE define and still using char for unicode-compatible text, because there will be no accidental way of feeding it into an ASCII-hungry API. As for the "hysterical raisins".. I fail to see any harm from supporting another MBCS-like locale named "CP_UTF8". I am sure microsoft will do it one day, and life will be good. – Pavel Radzivilovsky Jun 07 '12 at 13:50
@Pavel Radzivilovsky : `I am sure microsoft will do it one day, and life will be good`... If you say so... :-) – paercebal Jun 07 '12 at 14:12
@paercebal : You're right. Conversion can be a bottle neck. In this specific application, it does not happen often enough. The proof is that we are currently using UTF-16 internally even on Mac and converting to UTF-8 when we need to. One additional benefit I see from using UTF-8 internally is that the XCode debugger would actually be more helpful when looking at strings. From a user point of view, they don't care what the internal encoding is as long as everything works. Therefore, I feel that this would actually be a good place for a design decision that increases productivity. – Carl Jun 07 '12 at 21:56
@paercebal Thanks, I wish you would sign under utf8everywhere.org :) – Pavel Radzivilovsky Jun 14 '12 at 16:36
@Pavel Radzivilovsky : `I wish you would sign under utf8everywhere.org :)` : Ahem... I have nothing against UTF-8. My personnal computer is a Linux on UTF-8... I love fantasy, I love role-playing and video games... But seriously, I believe there's a better chance Bioware will do its homework and produce new and correct endings for Mass Effect 3 than Microsoft changing its char encoding on Windows (perhaps WinRT will offer UTF-8 chars, at the cost of UTF-8/UTF-16 conversion between Windows and the application, but the WinAPI itself, not a chance)... We'll see how Bioware fares about ME3... :-) – paercebal Jun 14 '12 at 18:35
1

The question if winAPI supports it or not is orthogonal to the question. Also, supporting UTF-8 for windows is not much more difficult than supporting the existing multi-byte ANSI locales.. – Pavel Radzivilovsky Jun 21 '12 at 23:49
@PavelRadzivilovsky: right, but let's face it, they don't want to. It's vendor lock-in, which microsoft is famous of. Perhaps someone should reverse engineer the `more` command to understand whether it deliberately doesn't support the UTF-8 code page, and sue microsoft if it is :) – Yakov Galka Jun 22 '12 at 10:23
@ Pavel Radzivilovsky, @ybungalobill : `supporting UTF-8 for windows is not much more difficult than supporting the existing multi-byte ANSI locales.` Not exactly. When a client has a japanese Windows and wants to use our `char` based application, they want it to work. If the next Windows decide to use UTF-8 by default for `char`, my japanese clients will have a bad surprise when upgrading, as our app won't work anymore. And Microsoft will be blamed for that. So, no, they won't support UTF-8 by default not because of lock-in, but because they have decades of code still running on Windows... – paercebal Jun 22 '12 at 19:24
@paercebal: 1) we are not talking about 'default' currently, even an 'optional' UTF-8 like in Linuxes would be a great achievement. 2) Even if it will be by default, how "our app won't work anymore", exactly? 3) ANSI API is mostly deprecated for a long time already. – Yakov Galka Jun 22 '12 at 19:37
@ybungalobill : `ANSI API is mostly deprecated for a long time already` If you say so. You'll have to convince my customers. . . `even an 'optional' UTF-8 like in Linuxes would be a great achievement` No, because no one would use it. When my mother installs her Windows, configuring to french, she does not want to deal with encodings, only language. . . `Even if it will be by default, how "our app won't work anymore", exactly?` Try it. Compile an app with multibyte char japanese text, and launch it on any not-japanese-configured Windows, and see if you get your hiragana characters. – paercebal Jun 23 '12 at 10:13
@paercebal This is a valid point; one is often forced to change the master locale for a certain application to work. I had to do it myself... to two alternating ANSI locales, one at a time. He would have to "undo" the default UTF-8 in this case... wait a sec.. why is there a char-based application under your supervision? Anyway, why is it not easy for MS to solve? You just provide a SetAnsiCP(UINT codepage) API for your app to run during initialization to be able to relax your customer in an emergency manner, without rewriting to Unicode. – Pavel Radzivilovsky Jul 16 '12 at 06:13
@Pavel Radzivilovsky : `why is there a char-based application under your supervision?` : Because there is a difference between my priorities and the whole team (including high managers') priorities. `You just provide a SetAnsiCP(UINT codepage)` : I fail to see how your solution would help me at all. What would be the effect of this function? – paercebal Jul 16 '12 at 08:46
@paercebal The effect I suggest would be setting how xxxA() APIs treat the strings supplied as parameters. This way, you would restore the compatibility regardless of OS default locale setting for your application that works with Japanese customers and makes use of a specific ANSI CP. This way you would change a little. An alternative way, of course, to handle this without hurting priorities is to instruct the user to set the global locale accordingly.. – Pavel Radzivilovsky Jul 18 '12 at 06:31
@paercebal I would also like to add, that I have recently converted a very large char-based windows application to a char application where char means UTF-8 (with nowide). It was not very hard. It could have almost been done automatically. In my estimate, it is also easier than converting all strings to WCHAR/TCHAR, simply because many of internal (computer-to-computer protocol) strings stayed intact. – Pavel Radzivilovsky Jul 18 '12 at 06:33
@Pavel Radzivilovsky : `The effect I suggest would be setting how xxxA() APIs` I don't see how I can do that without overloading the Win API (hiding the xxxA API with my own). You should offer an compilable example on `utf8everywhere.org` . . . . `I would also like to add, that I have recently converted a very large char-based windows application` : My own application is constituted by 363 binaries, both libraries and executables. We have millions of lines of code. The binaries communicate with each other (interprocess). No matter what we'll do, it'll cost time and resources. – paercebal Jul 18 '12 at 13:28
[1] Of course I cannot, we are discussing a change (addition) in the API. It is currently not possible. Internally, there is a state variable which these functions respect. [2] That's quite a lot, it is bigger than the one I converted (I had less than a million CL)... But my question to you is - if you had the liberty to do anything, how would you proceed? – Pavel Radzivilovsky Jul 23 '12 at 06:34
@Pavel Radzivilovsky : `if you had the liberty to do anything, how would you proceed?` : I'd use `TCHAR`s. . . ^_^ . . . – paercebal Jul 23 '12 at 08:38

score 1 · Answer 5 · answered Mar 08 '12 at 20:40

1

If you have an OS that takes wid(er) characters in its API, and you're writing an application that requires internationalization support, it is completely silly to be using char and UTF-8 as an internal representation in your program. You're using UTF-8 backwards. UTF-8 is for smuggling Unicode through operating systems interfaces, and storage and data interchange formats which cannot handle wide characters directly.

answered Mar 08 '12 at 20:40

Kaz

55,781
9
100
149

The OP specifically asked about cross-platform development. And it's *not* silly to use UTF-8 on Linux. – dan04 Mar 08 '12 at 22:12
I see; but I didn't say it is always silly to use UTF-8 (even if the underlying system API's are ASCII). (I don't use UTF-8 on Linux, internally, by the way. I do believe it is silly but in a different context, for different reasons, which I didn't assert here, until now. UTF-8 is a bad choice of internal representation for Unicode strings. If `str[42]` doesn't get you the 43rd character of the string, it's a bit of a nonstarter. IMO, of course. YMMV. You can get away with it in programs that don't really manipulate text but just pass it through, or do only some light text processing.) – Kaz Mar 08 '12 at 22:24
Cross platform development usually means writing code that compiles and runs on different platforms. UTF8 is of no particular help one way or the other. Sending data between platforms does require UTF8, but that is not usually meant by the term. – ravenspoint Mar 08 '12 at 22:36
1

Kaz, there are no guarantees that str[ 42 ] will give you the 43rd character with UTF-16 either. Not even with UTF-32 because the code points may include modifiers. Essentially the only way to obtain the 43rd character in any unicode encoding scheme is to parse the text. (Or call some API function that parses it for you.) – Mar 09 '12 at 17:06

score 0 · Answer 6 · answered Mar 08 '12 at 19:55

0

The "danger" is that UTF-8 character count is not the same as ASCII character count. E.g., U+24B62 is a single Unicode character but expands to 4 UTF-8 bytes. (See here for other examples.)

If you don't use the two interchangeably, you will be fine.

answered Mar 08 '12 at 19:55

MSN

53,214
7
75
105

I'm aware of the fact that the character to byte count is not necessarily a 1 to 1 mapping in UTF-8. It still makes it easier to use the STL with UTF-8 simply because I'm using the char type internally. Just wondering if anyone has gone down this road before and if they did, what problems they faced. – Carl Mar 08 '12 at 20:01
That's also the case for UTF-16. And for the the record UTF-32 has combining characters. So no matter the encoding if you want to handle text (as opposed to storing and passing binary data around) you have to know what you're doing (and/or delegate to an API you're familiar with). – Luc Danton Mar 08 '12 at 20:10
1

Guys, who ever cares what is the number of characters (codepoints)? It has nothing to do with UI string length. It has nothing do with string size in memory (strlen). Why does it matter for anything?. – Pavel Radzivilovsky Jun 07 '12 at 13:42
@PavelRadzivilovsky, it matters because size!=length*constant is confusing for people initially. It's not information you can derive just by looking at the code. – MSN Jun 07 '12 at 15:42
But what _is_ size? What is character count in Unicode? Some of the Unicode glyphs are really complicated, in Hebrew you can have 4 code points per one character.. How can one be confused if he does not do anything with the number of codepoints? – Pavel Radzivilovsky Jun 14 '12 at 16:33

score -1 · Answer 7 · answered Mar 08 '12 at 21:54

-1

UTF-8 is a wild and wacky way of representing characters. You should avoid using it wherever possible. The windows API avoids UTF-8. ( If you insist on a 'multibyte' build, rather than a 'unicode' build it will do all the conversions for you, under the covers, so it can continue to use UTF16 - and if you are not careful the inefficiency of all those hidden conversions will eat you up. ) The wxWidgets library avoids UTF-8 in the same way, and that is cross-platform with MACs.

You should take a hint from this, and avoid UTF-8 yourself.

When do you need to use UTF-8? The snag with UTF16 is that it depends on the byte order in the words implemented in the hardware. So when you transfer data between different computers, which might use a different byte order in their hardware, you have to use UTF8 which has the same byte order on any hardware. This is why browsers and WWW pages use UTF8.

answered Mar 08 '12 at 21:54

ravenspoint

19,093
6
57
103

2

UTF-16 is the wacky one. You're using wide characters that have byte order issues (can't send it over the network directly, etc), AND you still have to encode some characters. Then there are those "surrogate pairs", ugh. UTF-8 is clean compared to UTF-16. It's ugly to use UTF-8 for internal representation, except in programs that only do trivial text manipulations (e.g. parsing a path into components around slashes) or none at all (transparently passing through text). – Kaz Mar 09 '12 at 00:31
2

Windows should not be regarded as supporting UTF-16, but as Unicode that is limited the Basic Multilingual Plane. UTF-16 is basically a lie. Any system that handles strings of 16 bit characters can play the "but we support full Unicode through UTF-16" card. That's like saying that a Commodore 64 supports Unicode because it can manipulate bytes which can hold UTF-8. Show me that the *character type* holds Unicode points. Characters are a data type for representing text just as important as strings. – Kaz Mar 09 '12 at 00:39
Windows "avoids" UTF-8 encoding not because it is bad, but because ~30 years ago Windows designers made a decision to use UCS-2 for Windows API. After releasing Unicode 3.1 in 2001 the Windows developers switched to UTF-16 to support new characters along with keeping compatibility with existing Windows software. – Evgeni Nabokov Feb 28 '23 at 00:00

Are there any dangers to working internally in UTF-8 and then converting to UTF-16 only when needed in Windows?

7 Answers7

Avoid using char on Windows

Alternative: Why not using TCHAR on all platforms?

Avoid using `char` on Windows