2

I'm programming in C and want any UTF (i.e., “ru_RU-UTF-8″, “en_EN-UTF-8″, etc.) to all go ahead and convert to the wchar_t version (using the mbrtowc function). It doesn't even matter which wchar_t it converts to particularly, as long as it's a valid wchar_t in some local.

Is there a “UTF-8-whatever” setting I can pass to locale?

Like I’m looking for the exact opposite of setlocale("POSIX") / setlocale("C").

To clarify, the C code...

setlocale(LC_ALL, "ru_RU.UTF-8");
stuff = mbrtowc(..... )

works, where the C code...

setlocale(LC_ALL, "en_US.UTF-8");
stuff = mbrtowc(..... )

returns -1 as soon as it hits Cyrillic. The stuff I'm dealing with also might have Japanese characters, etc...

Kevin J. Chase
  • 3,856
  • 4
  • 21
  • 43
el capitan
  • 21
  • 3

1 Answers1

3

The problem with locales and wchar functions in C is that they're highly platform-dependent. For what it's worth, I have no problem converting Cyrillic UTF-8 to wchars with the en_US.UTF-8 locale on Linux (Ubuntu 16.04). The following code

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main() {
    const char in[] = "\xD0\xB1";
    wchar_t out;
    size_t consumed;

    setlocale(LC_ALL, "en_US.UTF-8");
    consumed = mbrtowc(&out, in, sizeof(in) - 1, NULL);
    if (consumed > 0) {
        printf("%04x\n", (unsigned)out);
    }

    return 0;
}

prints

0431

as expected. On other platforms, your mileage may vary. Platforms with a 16-bit wchar_t like Windows are especially problematic. But a sane platform should be able to encode and decode all Unicode characters with any UTF-8 locale, so there's no need for a generic UTF-8 locale.

If you simply want to work with UTF-8, you should consider a library for UTF-8 conversion like iconv, utf8proc, libunistring, or ICU. You can also write your own conversion routines. It isn't too hard.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113