0

Can you please help me convert an UCS-2 string to UTF-8 using ICU ?

I'm using the following code, but it doesn't work.

UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
char buf[1000];
int32_t buflen;

conv = ucnv_open("utf-8", &status);

if (U_FAILURE(status))
{
    LOG(L_ERROR, "%s: Can not open the ICU converter\n", __FUNCTION__);
}
else
{
    buflen = ucnv_fromUChars(conv, buf, sizeof(buf), (UChar*)sms->message.s, sms->message.len, &status);

    if (U_FAILURE(status))
    {
        LOG(L_ERROR, "%s: Error in conversion: %s\n", __FUNCTION__, u_errorName(status));
    }
}

LOG(L_DEBUG, "%s: Conversion made ...\n", __FUNCTION__);
hexdump(sms->message.s, sms->message.len);
hexdump(buf, buflen);

sms->message is a struct:

typedef struct str
{
    char *s;
    int len;
} str_t;

The hexdump prints the following (input text: "aaaa"):

[DEBUG] add_recv_sms_to_db: Conversion made ...
000000: 00 61 00 61 00 61 00 61                          .a.a.a.a
000000: e6 84 80 e6 84 80 e6 84 80 e6 84 80 00 00 49 00  ..............I.
  • You may also want to tag your question with `c` or `c++` or whatever this is to make sure the right people see it. – deceze Jul 31 '14 at 09:31

2 Answers2

0

e6 84 80 is UTF-8 for U+6100, a CJK unified ideograph. It looks like sms->message.s is in little endian while your system is interpreting it in big endian (so 0x0061 becomes 0x6100).

You can use the UCNV_UTF16_LittleEndian converter, or just perform a byte swap before passing sms->message.s to ICU.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • You're right, i needed a byte swap. I'm using this function: `short swap_bytes_16(short input)` `{` `return (input>>8) | (input<<8);` `}` – user3894831 Aug 30 '14 at 10:52
0

I am not sure if it is linked to the endiannes issue spotted by @ecatmur but you are casting sms->message.s which is a char* into a Uchar*

Looking at here:

Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t), or wchar_t if that is 16 bits wide; always assumed to be unsigned.

If neither is available, then define UChar to be uint16_t.

This makes the definition of UChar platform-dependent but allows direct string type compatibility with platforms with 16-bit wchar_t types.

Are you sure this cast is safe ?

Community
  • 1
  • 1
n0p
  • 3,399
  • 2
  • 29
  • 50