4

I'm using ISO 8859-1 (Latin extended ASCII char set) in my C application. When I strcpy/strcat the portions of the string together, it works fine. But when I use sprintf("%s %s"), on some runtimes (particularly certain versions of Android), the string will truncate when an extended ASCII character (specifically é, although I haven't tried others) is hit.

I thought %s was just supposed to copy the bytes until '\0' was hit. I suspect that strcpy/strcat works because it does do just that, without any formatting. What could possibly be going on here?

I should note that I'm not viewing the text using printf(), rather my own text rendering engine which handles ISO-8859-1 just fine.

UPDATE: To clarify, I have an NDK app, which is keeping the string in C, and passing it to my OpenGL based text rendering engine. If I pass the full string as a char* literal, it displays fine. If I sprintf() the portions together, it gets truncated at the é character. For example:

char buffer[1024];
strcpy(buffer, "This is ");
strcat(buffer, "the string I want to diésplay.");

That shows up fine. But this:

sprintf(buffer, "%s%s", "This is ", "the string I want to diésplay.");

Prints as:

This is the string I want to di
user1054922
  • 2,101
  • 2
  • 23
  • 37
  • are you sure this is sprintf()'s fault? what does strlen() say? – Oleksandr Kravchuk Jan 28 '16 at 15:33
  • I don't know because I do not have the particular android device (well known Samsung model) of the user encountering this error. On my own test device it works fine, and it also works fine on iOS and Win32. So I'm wondering if sprintf() on some Android runtimes treat char* as UTF-8 or something. – user1054922 Jan 28 '16 at 15:36
  • let me clarify: you've got NDK app, which is passing string to Java application? – Oleksandr Kravchuk Jan 28 '16 at 15:40
  • No. I have an NDK app, which is keeping the string in C, and passing it to my OpenGL based text rendering engine. If I pass the full string as a char* literal, it displays fine. If I sprintf() the portions together, it gets truncated at the é character. For example: char buffer[1024]; strcpy(buffer, "This is "); strcat(buffer, "the string I want to diésplay."); That shows up fine. But this: sprintf(buffer, "%s%s", "This is ", "the string I want to diésplay."); Prints as: This is the string I want to di – user1054922 Jan 28 '16 at 15:42
  • It is entirely possible to get UTF-8-encoded bytes into a C string, but that would not explain your user's observation, because the value `0` does not appear in the UTF-8 encoding of any character other than `'\0'`. It's possible that there is an encoding issue, but I cannot say what encoding could yield behavior such as you describe. – John Bollinger Jan 28 '16 at 15:45
  • Are you sure that the text is in ISO 8859-1 (please don't call that “enhanced ASCII,” that term is extremely ambiguous. Even UTF-8 is some sort of enhanced ASCII)? – fuz Jan 28 '16 at 16:18
  • Works fine at http://ideone.com/mVTBAq. – R Sahu Jan 28 '16 at 16:31
  • Yeah, works fine on iOS, Win32, and my Android device. I wondered if someone had come across this issue before with certain versions of Android, etc. – user1054922 Jan 28 '16 at 16:35

1 Answers1

1

The behavior of s[n]printf() is specified differently than the behavior of string-manipulation functions such as strcpy() and strcat(). The printf-family functions are all required to produce the same byte sequences when presented identical formats and print items. The only difference is in where those bytes are sent. Thus, if your C library were built such that it performed a transformation on string data (maybe a transcoding) when printing to the standard streams via printf(), then it would perform that same transformation when printing to a string via sprintf().

The "f" in "printf" is for "formatted". The standard neither says nor implies that formatting a string must mean dumping its bytes to the output verbatim, so a transcoding or other transformation such as I hypothesized above is not out of the question. In fact, the docs for some versions of these functions indicate locale-dependence ("Note that the length of the strings produced is locale-dependent and difficult to predict"), so transcoding in particular is a real possibility.

Any specific explanation of the third-party observations you describe would necessarily be speculative, as you have not presented nearly enough code or data to make a confident diagnosis. I am inclined to suspect an issue revolving around running the program in a locale that uses a character encoding differing from the one used internally by the program. If so, then you may be able to reproduce the problem locally by varying the locale in which you run, and you may be able to address it by ensuring one way or another that your program always runs in a suitable locale. Among other things, you might use the getlocale() and setlocale() functions to help here, especially if you want to limit the scope in which you exercise locale control.

Since ultimately you are relying on printf-family functions only for string manipulation, however, I think it would be better to use the workaround presented in the question: as much as possible, use C's dedicated string-manipulation functions, such as strcpy() and strncat(), to perform your string building. Since you are not relying on the stdio functions for your actual output, this should be fine.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • Fantastic answer! Exactly what I was looking for. I'll try messing with the locales and see if I can duplicate it. – user1054922 Jan 30 '16 at 14:09