Is there a UTF-8 locale for ANY language/country?

Question

I'm programming in C and want any UTF (i.e., “ru_RU-UTF-8″, “en_EN-UTF-8″, etc.) to all go ahead and convert to the wchar_t version (using the mbrtowc function). It doesn't even matter which wchar_t it converts to particularly, as long as it's a valid wchar_t in some local.

Is there a “UTF-8-whatever” setting I can pass to locale?

Like I’m looking for the exact opposite of setlocale("POSIX") / setlocale("C").

To clarify, the C code...

setlocale(LC_ALL, "ru_RU.UTF-8");
stuff = mbrtowc(..... )

works, where the C code...

setlocale(LC_ALL, "en_US.UTF-8");
stuff = mbrtowc(..... )

returns -1 as soon as it hits Cyrillic. The stuff I'm dealing with also might have Japanese characters, etc...

Use a library that handles that for you, like `iconv`. I don't think there is such "*locale*". — Iharob Al Asimi, Mar 18 '17 at 17:54
What is "RU-UTF-8" etc.?? And `wchar_t` is not guaranteed to represent al Unicode encodings. It depends on the platform. — too honest for this site, Mar 18 '17 at 18:01
Side note: `LC_CTYPE` requires [`language[_territory][.codeset][@modifier]`](http://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html) note **`.`** — dhke, Mar 18 '17 at 18:02

nwellnhof · Answer 1 · 2017-03-18T21:51:10.113

The problem with locales and wchar functions in C is that they're highly platform-dependent. For what it's worth, I have no problem converting Cyrillic UTF-8 to wchars with the en_US.UTF-8 locale on Linux (Ubuntu 16.04). The following code

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main() {
    const char in[] = "\xD0\xB1";
    wchar_t out;
    size_t consumed;

    setlocale(LC_ALL, "en_US.UTF-8");
    consumed = mbrtowc(&out, in, sizeof(in) - 1, NULL);
    if (consumed > 0) {
        printf("%04x\n", (unsigned)out);
    }

    return 0;
}

prints

as expected. On other platforms, your mileage may vary. Platforms with a 16-bit wchar_t like Windows are especially problematic. But a sane platform should be able to encode and decode all Unicode characters with any UTF-8 locale, so there's no need for a generic UTF-8 locale.

If you simply want to work with UTF-8, you should consider a library for UTF-8 conversion like iconv, utf8proc, libunistring, or ICU. You can also write your own conversion routines. It isn't too hard.

`0431` is _Unicode_ for `б`; stick with the _UTF-8_, namely hex `D0B1`. — Rick James, Mar 19 '17 at 15:43

Is there a UTF-8 locale for ANY language/country?

1 Answers1