22

I have to create a software that must work on several *nix platforms (Linux, AIX, ...).

I need to handle internationalization and my translation strings are in the following form:

"Hi %1, you are %2." // English
"Vous êtes %2, bonjour %1 !" // French

Here %1 stand for the name, and %2 for another word. I may change the format, that's not an issue.

I tried to use printf() but you cannot specify the order of the parameters, you just specify their types.

"Hi %s, you are %s"
"Vous êtes %s, bonjour %s !"

Now there is no way to know which parameter to use for replacement of %s: printf() just uses the first one, then the next.

Is there any alternative to printf() that deals with this ?

Note: gettext() is not an option.

ereOn
  • 53,676
  • 39
  • 161
  • 238

4 Answers4

26

I don't mean to be the bearer of bad tidings but what you're proposing is actually a bad idea. I work for a company that take i18n very seriously and we've discovered (painfully) that you cannot just slot words into sentences like that, since they often make no sense.

What we do is to simply disconnect the error text from the variable bits altogether, so as to avoid these problems. For, example, we'll generate an error:

XYZ-E-1002 Frobozz not configured for multiple zorkmids (F22, 7).

And then, in the description of the error, you state simply that the two values in the parentheses at the end were the Frobozz identifier and the number of zorkmids you tried to inflict on it.

This leaves i18n translation as an incredibly easy task since you have, at translation time, all of the language elements you need without worrying whether the variable bits should be singular or plural, masculine or feminine, first, second, or third declension (whatever the heck that actually means).

The translation team simply has to convert "Frobozz not configured for multiple zorkmids" and that's a lot easier.


For those who would like to see a concrete example, I have something back from our translation bods (with enough stuff changed to protect the guilty).

At some point, someone submitted the following:

The {name} {object} is invalid

where {name} was the name of a object (customers, orders, etc) and {object} was the object type itself (table, file, document, stored procedure, etc).

Simple enough for English, the primary (probably only) language of the developers, but they struck a problem when translating to German/Swiss-German.

While the "customers document" translated correctly (in a positional sense) to Kundendokument, the fact that the format string had a space between the two words was an issue. That was basically because the developers were trying to get the sentence to sound more natural but, unfortunately, only more natural based on their limited experience.

A bigger problem was with the "customers stored procedure" which became gespeichertes Verfahren der Kunden, literally "stored procedure of the customers". While the German customers may have put up with a space in Kunden dokument, there is no way to impose gespeichertes Verfahren der Kunden onto {name} {object} successfully.

Now you may say that a cleverer format string would have fixed this but there are several reasons why that would be incorrect:

  • this is a very simple example, there are likely to be others more complex (I'd try get some examples but our translation bods have made it clear they have more pressing work than to submit themselves to my every whim).
  • the whole point of the format strings is to externalise translation. If the format strings themselves are specific to the translation target, you've gained very little by externalising the text.
  • developers should not have to concern themselves with format strings like {possible-pre-adjectives} {possible-pre-owner} {object} {possible-post-adjectives} {possible-post-owner} {possible-postowner-adjectives}. That is the job of the translation teams since they understand the nuances.

Note that introducing the disconnect sidesteps this issue nicely:

The object specified by <parameter 1>, of type <parameter 2>, is invalid.
    Parameter 1 = {name}.
    Parameter 2 = {object}.
Der sache nannte <parameter 1>, dessen art <parameter 2> ist, ist falsch. 
    Parameter 1 = {name}.
    Parameter 2 = {object}.

That last translation was one of mine, please don't use it to impugn the quality of our translators. No doubt more fluent German speakers will get a good laugh out of it.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • +1 for the good advices. I cannot decide here. I will however try to explain them your point. – ereOn Oct 12 '10 at 09:14
  • @paxdiablo: I would be interested in a real-world example since I can’t think of a case where this wouldn’t work (given meaningful words/identifiers to slot into the sentence). I’ve actually written software this way so if you can point out caveats of the method, please do. – Konrad Rudolph Oct 12 '10 at 09:22
  • 2
    @Konrad Rudolph: Any situation where a grammatical object splits in two in one language but not in another. Consider for example `printf("I %s know", iKnow ? "" : "don't");` The French equivalents to "I know" and "I don't know" are "Je sais" and "Je ne sais pas". The negative won't fit the template that works fine for English. – JeremyP Oct 12 '10 at 10:03
  • 1
    @JeremyP: But that's more a result of sloppiness and why people are disparaged from injecting words that way instead of using complete sentences/phrases. – Ignacio Vazquez-Abrams Oct 12 '10 at 10:51
  • 1
    @pax: And how do non-technical users respond to messages like, "I'm sorry, we're out of stock for one of the items you ordered. The item, and the time it should be available by, are at the end of this message. (Widget23, Friday)"? ;-) Not saying your idea is a bad one, just that an error message directed at people who actually read manuals isn't a particularly difficult case, as i18n goes. – Steve Jessop Oct 12 '10 at 11:01
  • @Jeremy: Like Ignacio I don’t consider this a valid case. The word(s) “don’t” can’t be injected into a localized text at all, since it’s not localized itself. As far as I’m concerned, only single real words in the lexical sense may ever be inserted, which means that such a problem won’t arise. – Konrad Rudolph Oct 12 '10 at 11:04
  • 2
    @Konrad: A more common issue is that a naive English-speaker would create the template `"%d %s %s hanging on the wall"`, with expected values including green/blue/pink and bottle(s)/elephant(s). But you discover that in order to localise the second argument, you need to know the gender of the third argument. Both are "single real words", but they're not sufficiently independent to be inserted separately, and even in English the word to insert depends on the value for `%d`. Sometimes you can only insert one "thing" into a sentence, and translation needs to be smarter than `printf`. – Steve Jessop Oct 12 '10 at 11:15
  • 1
    @pax: declension means the way that nouns alter according to the case they appear in (which often depends on some preposition). There are few examples left in English, but pronouns still decline: there's I/me, she/her, and if you're being formal, "who" in the accusative case becomes "whom" - "Whom should I contact?". Latin has 6 cases and 5 "declensions". That is, there are 5 different groups a noun can belong to, and once you know the noun's stem, and its declension, and its gender, that tells you how the noun transforms in each of the 6 cases. `printf` doesn't handle this well ;-) – Steve Jessop Oct 12 '10 at 11:28
  • Steve, on top of gender issues, there's the ordering problem. The English "the red table" is the French "la table rouge". An English-speaking developer shouldn't need to know all the nuances of the twenty-odd locales we have to translate for. That's why it's done the way it is in our shop (thanks for the education in your last comment by the way, I've always wondered about that but we didn't get to do much Latin in a country school). @Konrad, I'll ask our Japanese/German translation bods for a specific example tomorrow when I'm back at work – paxdiablo Oct 12 '10 at 11:29
  • @pax: sure, but order is the one problem that `printf` does solve by itself, if you use a version with positional arguments. Everything else you mentioned (plural, gender, declension) requires actual logic in order to calculate the correct sentence from the variable arguments provided. As soon as the logic varies by language, printf-based localisation only gets you so far. – Steve Jessop Oct 12 '10 at 11:34
  • 1
    @Konrad Rudolph: It's just an example to show you the possible pitfalls. Also, in reality, you would localise whatever you are injecting. Also if you are assuming that whatever you can substitute as a single word in your native language can be substituted as a single word in any other language, you are wrong. – JeremyP Oct 12 '10 at 14:00
  • @Jeremy: Notice that I explicitly defined “words” as in the lexical sense. By this I mean atomic units regardless of language. So for example, a *file path* will always be the same unit, so having a localized string “`File %s was not found.`” should always be safe. Likewise for entities in the business context of the program (e.g. object names): “`Are you sure that you want to move %s to the trash?`”. Perhaps I should have been more explicit (“lexical” was a lousy term to use) but my question basically is: can *these* kinds of texts really make problems? If so, I’d really like to see examples. – Konrad Rudolph Oct 12 '10 at 14:10
  • @Konrad Rudolph: Actually you said "given meaningful words/identifiers". You didn't restrict the problem domain to simple things like just inserting a file path. Of course, if you do, then this kind of substitution does work. – JeremyP Oct 12 '10 at 14:25
  • This conversation reminds me of one time when I saw a video of one of those American stage televangelist types doing a performance in Russia, translated live over the speakers - at some point he decided to say every word individually for emphasis, and they got amusingly stuck for a good minute on "to". (In Russian, the meaning that "to" adds in English is instead usually "fused" into the words around it, by changing those words.) This is just one of countless things like this - every assumption you make so automatically that you don't even notice you are making it, some language violates. – mtraceur Aug 02 '19 at 15:15
  • Being fluent in two languages from a young age I've had the vantage point to see these little incompatibilities *constantly*. This answer is the generically right way - it will never lead you astray. You shouldn't need to mix constant strings with variable data in most cases anyway - the temptation to make it as natural-language-like as possible is illusory. Good UX design is to make the relevant information *easy to find and parse* (visually, or programmatically for accessibility aids), and most users will *avoid* fully reading your proper natural language sentences most of the time anyway. – mtraceur Aug 02 '19 at 15:24
25

POSIX printf() supports positional arguments.

printf("Hi %1$s, you are %2$s.", name, status);
printf("Vous êtes %2$s, bonjour %1$s !", name, status);
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 1
    Cool! I had no idea! Any reference? – Prof. Falken Oct 12 '10 at 09:09
  • Oh. But I must say I don't understand the manpage completely. How can printf("%*d", width, num); and printf("%2$*1$d", width, num); be equivalent if arguments are numbered left to right? – Prof. Falken Oct 12 '10 at 09:13
  • 1
    @Amigable, it does make sense if you think about it. This is _one_ thing being printed. `%2$*1$d` breaks down into `(%2$(*1$)d)` where the inner parenthesised bit specifies that param1 is to be used as the width for the parameter given by param2. It's equivalent to `%*d` breaking down into `(%(*)d)` with sequential assigning of param1 and param2. – paxdiablo Oct 13 '10 at 03:10
11

boost.format supports this the way like in python however this is for C++

Vinzenz
  • 2,749
  • 17
  • 23
  • 1
    boost::format is the best way because it is also typesafe. The implementation is not optimized, however, and calls through to printf under the covers (or did last time I looked) so is slower that using C printf directly by a factor of 2-3. – dajames Oct 13 '10 at 14:46
  • According to https://www.boost.org/doc/libs/1_70_0/libs/format/doc/format.html in 2019 is boost format still notably slower than printf. – Tomas Tintera Aug 16 '19 at 07:55
9

You want the %n$s extension that is common to most Unix systems.

"Hi %1$s, you are %2$s."

See the German example at the bottom printf

Regards DaveF

David Allan Finch
  • 1,414
  • 8
  • 20