3

I stumbled upon a problem while going through some unit tests, and I am not entirely sure why the following simple example crashes on the line with sprintf (Using Windows with Visual Studio 2019).

#include <stdio.h>
#include <locale.h>

int main()
{
    setlocale(LC_ALL, "en_US.utf8");
    char output[255];
    sprintf(output, "simple %ls text", L"\u00df\U0001d10b");
    return 0;
}

Is there something wrong with the code?

Julius
  • 1,155
  • 9
  • 19
  • Did you check the return value of `setlocale`? – Andreas Wenzel Jul 18 '20 at 11:43
  • @AndreasWenzel I just did now, it's "en_US.utf8" – Julius Jul 18 '20 at 11:44
  • Is `L"\u00df\U0001d10b"` supposed to be a valid wide character string or a valid UTF-8 string or are you simply attempting to define a certain byte sequence in memory? – Andreas Wenzel Jul 18 '20 at 12:53
  • @AndreasWenzel I strongly believe it to be a valid character, since an invalid character seems to lead to a compile error. – Julius Jul 18 '20 at 13:05
  • I can't really answer the question but this *may* be helpful: `int n = snprintf(output, 254, "simple %ls text", L"\u00df\U0001d10b");` gives a return value of 12 and puts `simple ßß` in the string (dropping the `text` part). – Adrian Mole Jul 18 '20 at 13:13
  • @jul You'd observe a compiler error only, if you allowed the compiler to see the value. The code you're using deliberately makes it impossible for the compiler to see the whole thing. The consequence isn't unusual: The fact that a C++ compiler accepts a program doesn't mean anything. Indeed, it doesn't mean, that the input even is a program. – IInspectable Jul 18 '20 at 13:30
  • @AdrianMole That's very interesting indeed! Seems quite weird to me.. – Julius Jul 18 '20 at 13:55
  • @IInspectable Could you elaborate? I am not entirely sure what you mean, doesn't the compiler sees the character sequences? I mean, I get a compiler error if I put `\U0011d10b` instead of `\U0001d10b`: `Error C3850 '\U0011D10B': a universal-character-name specifies an invalid character` – Julius Jul 18 '20 at 13:58
  • 4
    Well, it is a bug. Underlying issue is that the sprintf implementation converts one character at a time, using wctomb_s(). That function has a bug, it cannot properly convert a utf16 surrogate as designed and should return EILSEQ. It doesn't, returns 0 and reports -1 bytes copied, that blows the stack. A proper fix would be switching to c16rtomb() and ensuring the [C-11 defect report](http://cpp.arh.pub.ro/c/string/multibyte/c16rtomb) is applied. Meanwhile you'll have to do this yourself to sail around the bug. – Hans Passant Jul 18 '20 at 15:47

1 Answers1

0

char is 8-bit and wchar_t is 16-bit. When you try to convert the two, you will have to use functions like MultiByteToWideChar to convert between the two.

When you try to use Unicode strings in a multi-byte function, it causes buffer overflow, which might be the cause of your crashes.

Try using swprintf_s instead.

Sammy
  • 41
  • 8