On MSVC converting utf-16 to utf-32 is easy - with C11's codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature hasn't been implemented yet. Is there a way to perform such conversion on Linux without iconv (preferrably using conversion tools of std library)?
Asked
Active
Viewed 6,699 times
8
-
Is there a reason why you don't want to use iconv? – Dale Wilson May 28 '14 at 18:48
-
Is there a reason why you don't want to implement it yourself? – peppe May 28 '14 at 18:50
-
Well, the reason is that if this can be done with std - why to invent the wheel? – Al Berger May 28 '14 at 18:54
-
Because your std doesn't implement it :-) – peppe May 28 '14 at 18:54
-
The codecvt_utf16 is the only way in std for such conversion? – Al Berger May 28 '14 at 18:56
1 Answers
15
Decoding UTF-16 into UTF-32 is extremely easy.
You may want to detect at compile time the libc version you're using, and deploy your conversion routine if you detect a broken libc (without the functions you need).
Inputs:
- a pointer to the source UTF-16 data (
char16_t *
,ushort *
, -- for convenienceUTF16 *
); - its size;
- a pointer to the UTF-32 data (
char32_t *
,uint *
-- for convenienceUTF32 *
).
Code looks like:
void convert_utf16_to_utf32(const UTF16 *input,
size_t input_size,
UTF32 *output)
{
const UTF16 * const end = input + input_size;
while (input < end) {
const UTF16 uc = *input++;
if (!is_surrogate(uc)) {
*output++ = uc;
} else {
if (is_high_surrogate(uc) && input < end && is_low_surrogate(*input))
*output++ = surrogate_to_utf32(uc, *input++);
else
// ERROR
}
}
}
Error handling is left. You might want to insert a U+FFFD
¹ into the stream and keep on going, or just bail out, really up to you. The auxiliary functions are trivial:
int is_surrogate(UTF16 uc) { return (uc - 0xd800u) < 2048u; }
int is_high_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xd800; }
int is_low_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xdc00; }
UTF32 surrogate_to_utf32(UTF16 high, UTF16 low) {
return (high << 10) + low - 0x35fdc00;
}
¹ Cf. Unicode:
- § 3.9 Unicode Encoding Forms (Best Practices for Using U+FFFD)
- § 5.22 Best Practice for U+FFFD Substitution
² Also consider that the !is_surrogate(uc)
branch is by far the most common (as well the non-error path in the second if), you might want to optimize that with __builtin_expect
or similar.
-
Almost the same amount of code that is required for using codecvt_utf16. – Al Berger May 28 '14 at 19:25
-
-
Thanks! But what about the 0000..D7FF range? It seems like the is_surrogate function does not count it. – user2134488 Nov 24 '18 at 12:29
-
Uhm, re-reading the code *FIVE YEARS LATER*, I think I have made a typo in `is_surrogate`. I was thinking about using unsigned arithmetic, not signed. – peppe Jan 29 '19 at 17:49
-
-
Mr. Peppe, yes, it works now! Thank you very much! Is it possible to use your code in proprietary projects? – user2134488 Feb 13 '19 at 11:30
-
2Yes, of course. It's so simple I can hardly claim copyright on that. Pretty sure it's also found everywhere... – peppe Feb 19 '19 at 06:01
-
Thanks for fixing the typo! I will be putting this in production code, although adding error checking. – vy32 Dec 05 '20 at 16:41