Is there a cross-platform way to convert from UTF-8 to Latin/Arabic and from Latin/Arabicto UTF-8 in C++?
-
5ANSI/Windows-1252 is not able to represent all the characters UTF-8 can. So no, you can't reliably convert, unless you define some escape mechanism extending 1252. – Erik May 10 '11 at 12:32
-
1@Erik: Even worse, "ANSI" in this context is a weasel word for "whatever encoding is set at this moment" - it's not even guaranteed to be Windows-1252. – Piskvor left the building May 10 '11 at 12:35
-
I care about a specific set of characters (http://en.wikipedia.org/wiki/ISO/IEC_8859-6) – Abdelwahed May 10 '11 at 12:35
-
@Abdelwahed: Hmm, it should be possible to convert this to UTF8, just make a table mapping the (single-byte) characters of IES_8859-6 to (variable-length) UTF8 characters. The opposite way is harder - you'll need to replace every character that's not in the mapping into some placeholder, e.g. `?`. – Piskvor left the building May 10 '11 at 12:38
-
@Piskvor: There's really no need to code those table-lookups yourself. Although, of course, it would not be too hard. The inverse may be harder, even once you have UTF-8 split into logical characters: I haven't checked if 8859-6 has any characters which have more than one representation in Unicode, where you may have combining characters such as a+¨ for an ä (which also exists). – Christopher Creutzig May 10 '11 at 12:45
-
@Christopher Creutzig: You are correct, of course: no need to re-invent the wheel. Also, the existing libraries might normalize the character representations for you, as you noted. – Piskvor left the building May 10 '11 at 12:49
2 Answers
There are libraries like icu available. But Erik is, of course, right: The round-trip from Unicode through ISO 8859-6 will be lossy. (Yes, UTF-8 is “Unicode.” UTF-16, is “Unicode,” too, just having different bit-patterns for the same code number. See Joel Spolsky's text if you didn't know that. Or if you haven't read it yet, it's good material.)

- 8,656
- 35
- 45
-
1icu contains examples, e.g., in http://source.icu-project.org/repos/icu/icuapps/trunk/translitdemo/ – Christopher Creutzig May 12 '11 at 08:14
There is not, but there is a cross-platform way to convert between unicode represented in wchar_t
(which is 16-bit on Windows and 32-bit on most of the other platforms) and whatever is set as locale character encoding in the system using wcstombs
/mbstowcs
routines from standard C library or codecvt
facet of locale
in standard C++ library. The conversion between wchar_t
, where each element is one codepoint and utf-8 is than quite simple. So you can write or copy from somewhere a routine to convert between utf-8 and unicode in wchar_t
and combine it with wcstombs
/mbstowcs
.

- 73,652
- 13
- 125
- 172