How to convert form UTF-8 to Latin/Arabic and vice versa?

Question

Is there a cross-platform way to convert from UTF-8 to Latin/Arabic and from Latin/Arabicto UTF-8 in C++?

ANSI/Windows-1252 is not able to represent all the characters UTF-8 can. So no, you can't reliably convert, unless you define some escape mechanism extending 1252. — Erik, May 10 '11 at 12:32
@Erik: Even worse, "ANSI" in this context is a weasel word for "whatever encoding is set at this moment" - it's not even guaranteed to be Windows-1252. — Piskvor left the building, May 10 '11 at 12:35
I care about a specific set of characters (http://en.wikipedia.org/wiki/ISO/IEC_8859-6) — Abdelwahed, May 10 '11 at 12:35
@Abdelwahed: Hmm, it should be possible to convert this to UTF8, just make a table mapping the (single-byte) characters of IES_8859-6 to (variable-length) UTF8 characters. The opposite way is harder - you'll need to replace every character that's not in the mapping into some placeholder, e.g. `?`. — Piskvor left the building, May 10 '11 at 12:38
@Piskvor: There's really no need to code those table-lookups yourself. Although, of course, it would not be too hard. The inverse may be harder, even once you have UTF-8 split into logical characters: I haven't checked if 8859-6 has any characters which have more than one representation in Unicode, where you may have combining characters such as a+¨ for an ä (which also exists). — Christopher Creutzig, May 10 '11 at 12:45
@Christopher Creutzig: You are correct, of course: no need to re-invent the wheel. Also, the existing libraries might normalize the character representations for you, as you noted. — Piskvor left the building, May 10 '11 at 12:49

Christopher Creutzig · Accepted Answer · 2011-05-10T12:45:45.917

3

There are libraries like icu available. But Erik is, of course, right: The round-trip from Unicode through ISO 8859-6 will be lossy. (Yes, UTF-8 is “Unicode.” UTF-16, is “Unicode,” too, just having different bit-patterns for the same code number. See Joel Spolsky's text if you didn't know that. Or if you haven't read it yet, it's good material.)

edited May 10 '11 at 12:45

answered May 10 '11 at 12:37

Christopher Creutzig

8,656
35
45

1

icu contains examples, e.g., in http://source.icu-project.org/repos/icu/icuapps/trunk/translitdemo/ – Christopher Creutzig May 12 '11 at 08:14

Jan Hudec · Answer 2 · 2011-05-10T13:11:10.477

There is not, but there is a cross-platform way to convert between unicode represented in wchar_t (which is 16-bit on Windows and 32-bit on most of the other platforms) and whatever is set as locale character encoding in the system using wcstombs/mbstowcs routines from standard C library or codecvt facet of locale in standard C++ library. The conversion between wchar_t, where each element is one codepoint and utf-8 is than quite simple. So you can write or copy from somewhere a routine to convert between utf-8 and unicode in wchar_t and combine it with wcstombs/mbstowcs.

How to convert form UTF-8 to Latin/Arabic and vice versa?

2 Answers2