gettext character encoding

Question

I have the following gettext .po file, which has been translated from a .pot file. I am working on a Linux system (openSUSE if it matters), running gettext 0.17.

# 
#   <translate@transme.de>, 2011
# transer <translate@transme.de>, 2011
msgid ""
msgstr ""
"Project-Id-Version: transtest\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2011-05-24 22:47+0100\n"
"PO-Revision-Date: 2011-05-30 23:03+0100\n"
"Last-Translator: \n"
"Language-Team: German (Germany)\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Language: de_DE\n"
"Plural-Forms: nplurals=2; plural=(n != 1)\n"

#: transtest.cpp:12
msgid "Min Size"
msgstr "Min Größe"

Now, when I create the .mo file via

msgfmt -c transtest_de_DE.po -o transtest.mo

I then check the encoding with the "file" command,

file --mime transtest_de_DE.po
transtest_de_DE.po: text/x-po; charset=utf-8

and then install it to my locale folder and run the program after exporting LANG and LC_CTYPE, I end up with garbage where the two non-ASCII chars are.

If I set my terminal encoding to ISO-8859-2, rather than UTF-8, then I see the two characters correctly.

Looking inside the generated .mo file with a text editor the file appears to be in UTF-8 as well (I can see the symbols if I set my editor encoding to UTF-8).

The program is very simple, and it looks like so:

#include <iostream>
#include <locale>
const char *PROGRAM_NAME="transtest";

using namespace std;

int main()
{
    setlocale (LC_ALL, "");
    bindtextdomain( PROGRAM_NAME, "/usr/share/locale" );
    textdomain( PROGRAM_NAME );
    cerr << gettext("Min Size") << endl;
}

I am installing the .mo file to /usr/share/locale/de_DE/LC_MESSAGES/transstest.mo, and I have exported LC_CTYPE and LANG as "de_DE".

$ echo $LC_CTYPE; echo $LANG
de_DE
de_DE

Where am I going wrong? Why is gettext giving me the wrong encoding (ISO-8859-2) for my strings, rather than the requested (in the .po file) UTF-8?

Edit:

The solution was in Stack Overflow question Can't make (UTF-8) traditional Chinese character to work in PHP gettext extension (.po and .mo files created in poEdit) and it appears that I needed to explicitly call

bind_textdomain_codeset(PROGRAM_NAME, "utf-8");

The final program looks like so:

#include <iostream>
#include <locale>
const char *PROGRAM_NAME="transtest";

using namespace std;

int main()
{
    setlocale (LC_ALL, "");
    bindtextdomain( PROGRAM_NAME, "/usr/share/locale" );
    bind_textdomain_codeset(PROGRAM_NAME, "utf-8");
    textdomain( PROGRAM_NAME );
    cerr << gettext("Min Size") << endl;
}

No changes to any of my gettext files were needed.

I'm mighty shaky on locales, but if you wanted UTF-8 strings, shouldn't you set your `LANG=de_DE.utf8`? — sarnold, May 30 '11 at 22:44
I just tried that, but it does not seem to make any difference, even if I alter the .mo install location. Anyway, I have specified it in the .po file, which I would have thought gave gettext all the info it needs. — CodeT, May 30 '11 at 22:50
Oh man. five hours later I find this post: http://stackoverflow.com/questions/2264740/cant-make-utf-8-traditional-chinese-character-to-work-in-php-gettext-extension. Oh well, problem solved... sorry for the noise! — CodeT, May 30 '11 at 22:57
@CodeT, if that post actually provides the information you need to solve the problem, _please_ summarize its contents in an answer here and accept it. :) I don't immediately see how that answer would be useful to you, so hopefully your answer can help others in the future. — sarnold, May 30 '11 at 22:59
how do I accept an answer in the other URL?? The answer was that I needed to set the output text format explicitly using the bind_textdomain_codeset function, or gettext defaulted to whatever it felt like, which is not UTF-8 -- despite what I say in my .po file — CodeT, May 30 '11 at 23:02
@CodeT, just re-write that last comment into an answer on _this_ question, maybe copy in the code that _works_, and then click the accept mark. :) A link to the other question/answer might be kind, but not really necessary since it's already in the comments. — sarnold, May 30 '11 at 23:08
This site won't let me answer my question as i don't have enough "rep" or somesuch. I just put it in as an edit. — CodeT, May 30 '11 at 23:15
@CodeT, my last comment was quite wrong: http://meta.stackexchange.com/questions/86185/minimum-reputation-for-answering-your-own-question-should-be-higher-than-what-is/86186#86186 Looks like you need to wait an eternity (aka 8 hours). Sorry. — sarnold, May 31 '11 at 00:37
@CodeT: You should be able to move your solution down to an answer now so we can get this off the unanswered list. Thank you. — Bill the Lizard, May 31 '11 at 13:10

score 5 · Answer 1 · answered May 31 '11 at 13:41

If you have LC_CTYPE=de_DE (or LANG), programs are supposed to output ISO-8859-1 (note, 1, not 2), so if you have that and your terminal is set to utf-8, it's simply wrong. The correct locale for utf-8 is de_DE.utf-8.

Using bind_textdomain_codeset is wrong in your case. bind_textdomain_codeset is used if you want to work in fixed encoding internally, like e.g. GNOME does, but output should always be in what the locale specifies (obtained by calling nl_langinfo(CODESET), which is also what gettext does by default).

gettext character encoding

Edit:

1 Answers1