9

Why utf8 symbols cannot be printed via glib functions?

Source code:

#include "glib.h"
#include <stdio.h>

int main() {
    g_print("марко\n");
    fprintf(stdout, "марко\n");
}

Build it like this:

gcc main.c -o main $(pkg-config glib-2.0 --cflags --libs)

You could see that glib can't print utf8 and fprintf can:

[marko@marko-work utf8test]$ ./main 
?????
марко
Marko Kevac
  • 2,902
  • 30
  • 47

4 Answers4

10

fprint functions assume that every string you print with them is correctly encoded to match the current encoding of your terminal. g_print() does not assume that and will convert the encoding if it thinks that is necessary; of course this is a bad idea, if the encoding was actually correct before, since that will most likely destroy the encoding. What is the locale setting of your terminal?

You can either set the correct locale by environment variables on most systems or you can do it programatically using the setlocale function. The locale names are system dependent (not part of the POSIX standard), but on most systems the following will work:

#include <locale.h>

:

setlocale(LC_ALL, "en_US.utf8");

Instead of LC_ALL you can also only set the locale for certain operations (e.g. "en_US" will cause English number and date formatting, but maybe you don't want numbers/dates to be formatted that way). To quote from the setlocale man page:

LC_ALL Set the entire locale generically.

LC_COLLATE Set a locale for string collation routines. This controls alphabetic ordering in strcoll() and strxfrm().

LC_CTYPE Set a locale for the ctype(3) and multibyte(3) functions. This controls recognition of upper and lower case, alphabetic or non-alphabetic characters, and so on.

LC_MESSAGES Set a locale for message catalogs, see catopen(3) function.

LC_MONETARY Set a locale for formatting monetary values; this affects the localeconv() function.

LC_NUMERIC Set a locale for formatting numbers. This controls the formatting of decimal points in input and output of floating point numbers in functions such as printf() and scanf(), as well as values returned by localeconv().

LC_TIME Set a locale for formatting dates and times using the strftime() function.

The only two locale values that are always available on all systems are "C", "POSIX" and "".

Only three locales are defined by default: the empty string "" (which denotes the native environment) and the "C" and "POSIX" locales (which denote the C-language environment). A locale argument of NULL causes setlocale() to return the current locale. By default, C programs start in the "C" locale. The only function in the library that sets the locale is setlocale(); the locale is never changed as a side effect of some other routine.

Mecki
  • 125,244
  • 33
  • 244
  • 253
  • After setlocale(LC_ALL, "en_US.UTF-8") everything works, but without it and with LANG=en_US.UTF-8 ./main it does not work. Why is this? System default is en_US.UTF-8. – Marko Kevac Jun 22 '10 at 11:29
  • Don't you have to export the variable to be visible to the sub-process? Also the variables are named as shown on the man page, try `export LC_ALL=en_US.utf8 && ./main`; maybe it is also enough to set LC_CTYPE for string printing only. – Mecki Jun 22 '10 at 12:08
  • You need export if you want to 'save' variable. If you want it just for one application, than it's enough to put it before program name. Anyway, I have done export for LANG, LC_ALL and LC_CTYPE. Nothing. Still don't work. Strange... – Marko Kevac Jun 22 '10 at 13:25
  • 1
    Use `setlocale(LC_CTYPE, "")` !! the important thing is to always use the `""` string for setlocale, *not* a hardcoded locale.. – u0b34a0f6ae Jun 23 '10 at 00:34
  • If you set it to "" then it won't be necessarily UTF8 and it might again not print correctly, because "" means no locale and no locale means that anything other than ASCII is not defined for strings. – Mecki Jun 23 '10 at 10:42
  • Mecki, that is not correct. From the `setlocale` man page: 'If locale is "", each part of the locale that should be modified is set according to the environment variables.' – skagedal May 19 '13 at 21:01
  • @skagedal That's maybe how *your system* is doing it, but that is not POSIX required behavior. See http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html Unless a system wants to be XSI-conformant, `""` means set locale to a system specific native value which can be pretty much anything (yet, no matter what it is, ASCII is always guaranteed to work). – Mecki May 22 '13 at 14:41
  • @Mecki - I stand corrected! Sorry and thanks for the clarification. (For the record, the man page referenced in my previous comment was the Linux one.) – skagedal May 23 '13 at 20:21
2

You need to initialize the locale's encoding by calling setlocale at your program's start.

setlocale(LC_CTYPE, "")

This is normally carried out for you if you use some initialization function like gtk_init(..) or similar.

u0b34a0f6ae
  • 48,117
  • 14
  • 92
  • 101
1

The string passed from g_print() to glibc is not necessarily in UTF-8 encoding since g_print() does character set conversion to the charset specified by the locale.

Luca Matteis
  • 29,161
  • 19
  • 114
  • 169
  • I would not be so sure. I can't find confirmation of your believes for modern Glib versions. This does not work for Glib 2.56.4 and earlier, for example. – Dennis V Mar 15 '20 at 12:37
0

Usually it is not recommended to use anything other than ASCII inside text files. You should use tools like gettext in order to translate words from different languages. If this is out of the question then you should store your string in UTF-8 in your code.

Try printing this one (it's the hexadecimal representation of your string):

char hex_marco[]={0xD0, 0xBC, 0xD0, 0xB0, 0xD1, 0x80, 0xD0, 0xBA, 0xD0, 0xBE, 0};

This works for me in printf (cannot test here with glib):

#include <stdio.h>

char hex_marco[]={0xD0, 0xBC, 0xD0, 0xB0, 0xD1, 0x80, 0xD0, 0xBA, 0xD0, 0xBE, 0};

int main(void)
{
    printf("%s\n",hex_marco);
    return 0;
}

Redirect the output to file and see it as UTF-8.

Hope it helps.

INS
  • 10,594
  • 7
  • 58
  • 89
  • "marko" in *.c file was just for example. I am not using UTF-8 inside source code. Right answer was already given. Thank you anyway! – Marko Kevac Jun 22 '10 at 11:33