I'm assuming your project is not about text processing, manipulation or transformation: For text processing, it is far easier to chose one and only one encoding, the same on all platforms, and then do the conversion if needed when using the native API.
But if your project is not centered around text processing/manipulation/transformation, then the restriction to UTF-8 on all platforms is not the simpliest solution.
Avoid using char
on Windows
If you work with the char
type on Windows development, then all the WinAPI will use char
.
The problem is that the char
type on Windows is used for "historical" applications, meaning pre-unicode application.
Every char
text is interpreted as a non-Unicode text whose encoding/charset is chosen by the Windows user, not you the developper.
Meaning: If you believe you're working with UTF-8, send that UTF-8 char
text to the WinAPI to output on GUI (and TextBox, etc.), and then execute your code on a Windows set up on Arabic (for example), then you'll see your pretty UTF-8 char text won't be handled correctly by the WinAPI because the WinAPI on that Windows believes all the char
are to be interpreted as Windows-1256 encoding.
If you're working with char
on Windows, you're forsaking Unicode unless every call to the WinAPI goes through a translation (usually through a Framework like GTK+, QT, etc., but it could be your own wrapper functions).
Optimization is the Root of all Evil, but then, converting all your UTF-8 texts from and to UTF-16 each time you discuss with Windows does seems to me to be quite an useless pessimization.
Alternative: Why not using TCHAR on all platforms?
What you should do is work with TCHAR
, provide a header similar to tchar.h
for Linux/MacOS/Whatever (redeclaring the macros, etc. in the original tchar.h
header), augmenting it with a tchar.h
-like header for the Standard Library objects you want to use. For example, my own tstring.hpp
goes like:
// tstring.hpp
#include <string>
#include <sstream>
#include <fstream>
#include <iostream>
#ifdef _MSC_VER
#include <tchar.h>
#include <windows.h>
#else
#ifdef __GNUC__
#include <MyProject/tchar_linux.h>
#endif // __GNUC__
#endif
namespace std
{
#ifdef _MSC_VER
// On Windows, the exact type of TCHAR depends on the UNICODE and
// _UNICODE macros. So the following is useful to complete the
// tchar.h headers with the C++ Standard Library's symbols.
#ifdef UNICODE
typedef wstring tstring ;
// etc.
static wostream & tcout = wcout ;
#else // #ifdef UNICODE
typedef string tstring ;
// etc.
static ostream & tcout = cout ;
#endif // #ifdef UNICODE
#else // #ifdef _MSC_VER
#ifdef __GNUC__
// On Linux, char is expected to be UTF-8 encoded, so the
// following simply maps the txxxxx type into the xxxxx
// type, forwaking the wxxxxx altogether.
// Of course, your mileage will vary, but the basic idea is
// there.
typedef string tstring ;
// etc.
static ostream & tcout = cout ;
#endif // __GNUC__
#endif // #ifdef _MSC_VER
} // namespace std
Discplaimer: I know, it's evil to declare things in std
, but I had other things to do than be pedantic on that particular subject.
Using those headers, you can use the C++ Standard Library combined with the TCHAR
facility, that is, use std::tstring
, which will be compiled as std::wstring
on Windows (provided you compile defining theUNICODE
and _UNICODE
defines) and as std::string
on the other char
-based OSes you want to support.
Thus, you'll be able to use the platform's native character type at no cost whatsoever.
As long as you are agnostic with your TCHAR
character type, there won't be any problem.
And for the cases you really want to deal with the dirty side of UTF-8 vs. UTF-16, then you need to provide the code for conversion (if needed), etc..
This is usually done by providing overloads of the same function for different types, and for each OS. This way, the right function is selected at compile time.