I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf():
The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. Therefore the formatting is incorrect.
Simple example:
#include <stdio.h>
int main(int argc, char *argv[])
{
const char* testMsg = "Tääääßt";
char buf[1024];
int len;
sprintf(buf, "|%7.7s|", testMsg);
len = strlen(buf);
printf("Result=\"%s\", len=%d", buf, len);
return 0;
}
The result is:
Result="|Täää|", len=7
Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient.
So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library.
Currently I'm working on QNX 6.4, but replies for other operating systems. e.g. Linux, are also very welcome.