3

I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf():

The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. Therefore the formatting is incorrect.

Simple example:

#include <stdio.h>

int main(int argc, char *argv[])
{
    const char* testMsg = "Tääääßt";
    char buf[1024];
    int len;

    sprintf(buf, "|%7.7s|", testMsg);
    len = strlen(buf);
    printf("Result=\"%s\", len=%d", buf, len);

    return 0;
}

The result is:

 Result="|Täää|", len=7

Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient.

So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library.

Currently I'm working on QNX 6.4, but replies for other operating systems. e.g. Linux, are also very welcome.

mh.
  • 650
  • 4
  • 8
  • Your example output omits the leading '|' character, which seems unlikely to reflect what really happened. – unwind Feb 17 '12 at 09:12
  • Could you use a Unicode library (like http://www.flexiguided.de/publications.utf8proc.en.html) and feed `printf` the number of bytes for a Unicode string? – trojanfoe Feb 17 '12 at 09:15
  • 3
    Just a warning, counting "characters" in Unicode data is quite a complicated business. Besides the fact that each code point in UTF-8 is composed of several bytes, each glyph (or "grapheme") can be composed of several code points, and for that reason `fwprintf` is inadequate for truncating Unicode data anyway -- for example you could cut off an accent without cutting off the character it applies to. So whatever you end up using, make sure that the meaning of the length you specify is clear to you. – Steve Jessop Feb 17 '12 at 09:20
  • possible duplicate of [What is the best unicode library for C?](http://stackoverflow.com/questions/114611/what-is-the-best-unicode-library-for-c) –  Feb 17 '12 at 09:26
  • Functions like `len()` unambiguously return the number of bytes (or, well, elements). The fact that they display as a different number of characters in your locale is basically outside the control of C. If you want display width, don't use a function for counting bytes. – tripleee Feb 17 '12 at 09:30
  • @Steve Jassop QNX has a function utf8strlen() that counts the number of chars in an UTF-8 string. It will work for me now, although I haven't tested yet if it will work correctly for all special cases ;-) . – mh. Feb 17 '12 at 10:08
  • @mh.: The documentation doesn't quite spell this out (it says "UTF-8 characters"), but I reckon `utf8strlen` measures the number of code points. So as Dietrich notes, if you truncate a string that looks like `"Tä"` to 2 "characters", you would end up with `Ta` if the original string was U+0054 U+0061 U+0308. – Steve Jessop Feb 17 '12 at 10:15
  • @tripleee You should not be dealing with bytes or with locales if you are working with Unicode. You should be dealing with abstract code points in the Universal Character Set, and there should be no locale effects having to do with printing. – tchrist Feb 17 '12 at 20:02
  • @SteveJessop It really shouldn’t be hard. The right library makes these things trivial. You should be able to count and step through by either code point or by grapheme without any fuss. However, C and C++ are still rather behind the curve on this. The Web is now >80% Unicode, a 600% growth explosion over the last 5 years. Many other languages make this much easier than C or C++ do. – tchrist Feb 17 '12 at 20:06
  • @tchist: "The right library makes these things trivial". True of many things -- finding a surgeon is easy but I would still describe surgery as "hard" ;-p In this case, writing the library would be hard (or at least would require careful attention to quite a large standard). Integrating one isn't necessarily easy if the existing code has an inappropriate notion of "length" that needs to be unpicked into several different notions (at least: buffer length, number of code points, number of graphemes). That's why it's important that the meaning of "length" is clear in any given context. – Steve Jessop Feb 19 '12 at 17:50

2 Answers2

10

Well, once you ask printf to do intelligent padding of Unicode characters, you run into major problems. As they say,

w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞

  • How many Unicode characters are in Tääääßt? Well, it could be anywhere from 7 to 11, depending on how it's encoded. Each ä can be written as U+00E4, which is one character, or it could be written as U+0061 U+0308, which is two characters. So your next hope is to count grapheme clusters. (No, normalization won't make the problem go away.)

  • But, how wide is a grapheme cluster? Obviously, a is one column wide. U+200B should be zero columns wide, it's a "zero-width" space. Should each ひらがな be two columns wide? They usually are in terminal emulators. What happens when you format ひらがな as 7 columns, do you get "ひらが ", which adds a space, or do you get "ひらが", which is only 6 columns?

  • If you cut something up which mixes RTL and LTR text, should you reset the text direction afterwards? What are you going to do? (Some terminal emulators, such as Apple's, support a mixture of left-to-right and right-to-left text.)

  • What is your goal by truncating text? Are you trying to show the user a string in limited space, or are you trying to write a format that uses fixed-width fields?

Basically, if you want to cut Unicode text into chunks, you shouldn't be doing it with something as simple as printf (or wprintf, which is quite possibly worse). Use LibICU (website) to iterate over the breaks you want. Writing a UTF-8 aware version of printf is asking for all sorts of trouble that you don't want.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • I think I understand the problems you mentioned and am aware that some of them are not solveable satisfyingly in ASCII. However, for now I'd be happy with a straight-forward replacement for printf() that will work with European and Asian characters and does not need to take exotical features like changes in text direction into account. My goal concerning truncation in the format is fixed-width fields. I know that this will not work well with Asian chars that can be wider even in "Courier", but for now this will work for me, until I find time to redesign the app's ASCII-based printing. – mh. Feb 17 '12 at 10:05
0

The following C99 code snippet defines the function u8printf where format specifiers such as %10s yield 10 utf-8 code points, that is characters rather than bytes. Don't forget to set the locale with setlocale(LC_ALL,"") somewhere before this routine is called. This works because the wprintf uses wchar_t internally. You can define u8fprintf and u8sprintf in a similar way. If you want to write this without C99 variable length arrays than a suitable combination of malloc/free is also possible.

int u8printf(char *fmt,...){
    va_list ap;
    va_start(ap,fmt);
        int n=mbstowcs(0,fmt,0);
        if(n==-1) return -1;
        wchar_t wfmt[n+1];
        mbstowcs(wfmt,fmt,n+1);
        for(int m=128;m<=32768;m*=2){
            wchar_t wbuf[m];
            int r=vswprintf(wbuf,m,wfmt,ap);
            if(r!=-1) {
                char buf[m*4];
                wcstombs(buf,wbuf,m*4);
                fputs(buf,stdout);
                return r;
            }
        }
        return -1;
    va_end(ap);
}
ejolson
  • 146
  • 4