4

Is there a cross-platform way to convert from UTF-8 to Latin/Arabic and from Latin/Arabicto UTF-8 in C++?

unwind
  • 391,730
  • 64
  • 469
  • 606
Abdelwahed
  • 1,694
  • 4
  • 21
  • 31
  • 5
    ANSI/Windows-1252 is not able to represent all the characters UTF-8 can. So no, you can't reliably convert, unless you define some escape mechanism extending 1252. – Erik May 10 '11 at 12:32
  • 1
    @Erik: Even worse, "ANSI" in this context is a weasel word for "whatever encoding is set at this moment" - it's not even guaranteed to be Windows-1252. – Piskvor left the building May 10 '11 at 12:35
  • I care about a specific set of characters (http://en.wikipedia.org/wiki/ISO/IEC_8859-6) – Abdelwahed May 10 '11 at 12:35
  • @Abdelwahed: Hmm, it should be possible to convert this to UTF8, just make a table mapping the (single-byte) characters of IES_8859-6 to (variable-length) UTF8 characters. The opposite way is harder - you'll need to replace every character that's not in the mapping into some placeholder, e.g. `?`. – Piskvor left the building May 10 '11 at 12:38
  • @Piskvor: There's really no need to code those table-lookups yourself. Although, of course, it would not be too hard. The inverse may be harder, even once you have UTF-8 split into logical characters: I haven't checked if 8859-6 has any characters which have more than one representation in Unicode, where you may have combining characters such as a+¨ for an ä (which also exists). – Christopher Creutzig May 10 '11 at 12:45
  • @Christopher Creutzig: You are correct, of course: no need to re-invent the wheel. Also, the existing libraries might normalize the character representations for you, as you noted. – Piskvor left the building May 10 '11 at 12:49

2 Answers2

3

There are libraries like icu available. But Erik is, of course, right: The round-trip from Unicode through ISO 8859-6 will be lossy. (Yes, UTF-8 is “Unicode.” UTF-16, is “Unicode,” too, just having different bit-patterns for the same code number. See Joel Spolsky's text if you didn't know that. Or if you haven't read it yet, it's good material.)

Christopher Creutzig
  • 8,656
  • 35
  • 45
0

There is not, but there is a cross-platform way to convert between unicode represented in wchar_t (which is 16-bit on Windows and 32-bit on most of the other platforms) and whatever is set as locale character encoding in the system using wcstombs/mbstowcs routines from standard C library or codecvt facet of locale in standard C++ library. The conversion between wchar_t, where each element is one codepoint and utf-8 is than quite simple. So you can write or copy from somewhere a routine to convert between utf-8 and unicode in wchar_t and combine it with wcstombs/mbstowcs.

Jan Hudec
  • 73,652
  • 13
  • 125
  • 172