Getting wrong UTF-8 values by casting char into USHORT

Question

This is my first question here, so feel free to criticize or correct me if I am missing important rules.

Recently I was tasked with porting old DOS C-code to a Linux platform. The Font handling is realized by bitfonts. I wrote a function that is capable to draw the selected glyph if you pass the correct Unicode value into it.

However, if I try to cast the char into a USHORT (functions expects this type) I get the wrong value when the character is outside of the ASCII-table.

char* test;
test = "°";

printf("test: %hu\n",(USHORT)test[0]);

The displayed number (console) should be 176 but is instead 194.

If you use "!" the correct value of 33 will be displayed. I made sure that char is unsigned by setting the GCC compiler flag

-unsigned-char

The GCC compiler uses UTF-8 encoding as the default. I really don't know where the issue is right now.

Do I need to add another flag to the compiler?

Update

With the help of @Kninnug answer, I managed to write a code that will produce the desired results for me.

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <wchar.h>
#include <stdint.h>

int main(void)
{
   size_t n = 0, x = 0;
   setlocale(LC_CTYPE, "en_US.utf8");
   mbstate_t state = {0};
   char in[] = "!°水"; // or u8"zß水"
   size_t in_sz = sizeof(in) / sizeof (*in);

   printf("Processing %zu UTF-8 code units: [ ", in_sz);
   for(n = 0; n < in_sz; ++n)
   {
      printf("%#x ", (unsigned char)in[n]);
   }
   puts("]");

   wchar_t out[in_sz];
   char* p_in = in, *end = in + in_sz;
   wchar_t *p_out = out;
   int rc = 0;
   while((rc = mbrtowc(p_out, p_in, end - p_in, &state)) > 0)
   {
       p_in += rc;
       p_out += 1;
   }

   size_t out_sz = p_out - out + 1;
   printf("into %zu wchar_t units: [ ", out_sz);
   for(x = 0; x < out_sz; ++x)
   {
      printf("%u ", (unsigned short)out[x]);
   }
   puts("]");
}

However, when I run this on my embedded device, the non-ASCII characters get merged into one wchar, not into two like on my computer.

I could use single-byte encoding with cp1252 (this worked fine) but I would like to keep using unicode.

codepoint 176 is encoded in a two bytes utf8 sequence: `C2 B0` you are printing only the first byte of this sequence. — wildplasser, Apr 25 '18 at 16:04
Never use any non-ASCII characters in program source code. Avoid using `-unsigned-char` in any program that calls standard C library functions. — n. m. could be an AI, Apr 25 '18 at 16:24
When you say you want ° to be encoded as 176, which character set and encoding do you mean? UTF-8 is a character encoding for the Unicode character set. If you need to match "old DOS", in English, the character set was likely CP437 but there ° is 143. — Tom Blodget, Apr 25 '18 at 16:56
I am kinda forced to use non-ASCII characters in program source code since it **will be** legacy code and should not be changed right now. Will the `-unsigned-char` compiler flag break any Standard C functions? — J.Panek, Apr 26 '18 at 09:23
to start, use `wchar` not `char` and, of course and reading of user input should use the wide char functions and any display should use the wide char functions — user3629249, Apr 26 '18 at 13:54

Kninnug · Accepted Answer · 2018-04-25T18:02:00.733

A char (signed or unsigned) is a single byte in C ¹. (USHORT)test[0] only casts only the first byte in test, but the character in it occupies 2 in the UTF-8 encoding (you can check that with strlen, which counts the number of bytes before the first 0-byte).

To get the proper code point you need to decode the entire UTF-8 sequence. You can do this with mbrtowc and related functions:

char* test;
test = "°";
int len = strlen(test);

wchar_t code = 0;
mbstate_t state = {0};

// convert up to len bytes in test, and put the result in code
// state is used when there are incomplete sequences: pass it to
// the next call to continue decoding
mbrtowc(&code, test, len, &state); // you should check the return value

// here the cast is needed, since a wchar_t is not (necessarily) a short
printf("test: %hu\n", (USHORT)code);

Side notes:

If USHORT is 16 bits (as is commonly the case), it is not strictly enough to cover the entire UTF-8 range, which needs (at least) 21 bits.
When you have obtained the proper code point, the cast should not be necessary to pass it to the drawing function. If the function definition or prototype is visible, the compiler can convert the value by itself.

¹ The confusing name comes from the time when all the world's English and all the ASCII code points could fit in a single byte. Hence, a character was the same as a byte.

"with mbrtowc" just use `L'...'` wide characters to get directly to the code points, no conversion needed. But using non-ascii stuff in literals is dangerous, one can use `L'\u1234'` notation instead. No wait a minute, why not just `\x1234`? Of course one still needs to convert data loaded from a file. — n. m. could be an AI, Apr 25 '18 at 16:39
When I use this approach `code` will not change to the value in `test`, instead, it keeps the first assigned value. `mbrtowc` looks promising, but it seems I cannot get this function to work. Can this be a compiler issue of some sorts? However, will I get the Unicode codepoint from this function or something different? — J.Panek, Apr 26 '18 at 09:32
I Forgot to mention that I get -1 as the return value of `mbrtowc` when I use `test = "°"` but not when `test = "!"` or any other ASCII character. When I declare the char unsigned I can at least extract the two bytes of the encoded "°".`printf("test: %hu\n",(USHORT)test[0])` prints 194 and `printf("test: %hu\n",(USHORT)test[1])` prints 176 which is the correct encoding for the char. — J.Panek, Apr 26 '18 at 10:20

Getting wrong UTF-8 values by casting char into USHORT

1 Answers1