In C, how to print UTF-8 char if given its bytes in char variables?

Question

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?

Similarly for the 3 and 4 byte UTF-8 characters?

I've been trying all kinds of approaches with mbstowcs() but I just can't get it to work.

My recent answer on utf-8 may help: [Searching letters in the two dimensional array in C](https://stackoverflow.com/a/73887619/5382650) — Craig Estey, Oct 01 '22 at 13:02
In general with UTF-8: "char" is just a wrong name for "byte". A real Unicode char should be represented by a string. MB doesn't usually help. — Giacomo Catenazzi, Oct 03 '22 at 08:20

qrsngky · Answer 1 · 2022-10-04T04:35:00.033

I managed to write a working example.
When c1 is '\xce' and c2 is '\xb8', the result is θ.
It turns out that I have to call setlocale before using mbstowcs.

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
 
int main()
{
   char* localeInfo = setlocale(LC_ALL, "en_US.utf8");
   printf("Locale information set to %s\n", localeInfo);
   
   const char c1 = '\xce';
   const char c2 = '\xb8';
   int byteCount = 2;

   char* mbS = (char*) malloc(byteCount + 1);
   mbS[0] = c1; 
   mbS[1] = c2; 
   mbS[byteCount] = 0; //null terminator
   printf("Directly using printf: %s\n", mbS);
   
   
   int requiredSize = mbstowcs(NULL, mbS, 0); 
   printf("Output size including null terminator is %d\n\n", requiredSize +1);
   
   wchar_t *wideOutput = (wchar_t *)malloc( (requiredSize +1) * sizeof( wchar_t ));
   
   int len = mbstowcs(wideOutput , mbS, requiredSize +1 ); 
   if(len == -1){
       printf("Failed conversion!");
   }else{
       printf("Converted %d character(s). Result: %ls\n", len, wideOutput );
   }
   return 0;
    
}

Output:

Locale information set to en_US.utf8
Directly using printf: θ
Output size including null terminator is 2

Converted 1 character(s). Result: θ

For 3 or 4 byte utf8 characters, one can use a similar approach.

KamilCuk · Answer 2 · 2022-10-04T05:56:41.493

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?

They are already an UTF-8 character. You would just print them.

putchar(c1);
putchar(c2);

It's up to your terminal or whatever device you are using to display the output to properly understand and render the UTF-8 encoding. This is unrelated to encoding used by your program and unrelated to wide characters.

Similarly for the 3 and 4 byte UTF-8 characters?

You would output them.

If your terminal or the device you are sending the bytes to does not understand UTF-8 encoding, then you have to convert the bytes to something the device understands. Typically, you would use an external library for that, like iconv. Alternatively, you could setlocale("C.utf-8") then convert your bytes to wchar_t, then setlocale("C.your_target_encoding") and then convert the bytes to that encoding or output the bytes with %ls. All %ls does (on common systems) is it converts the string back to multibyte and then outputs it. Wide stream outputting to terminal does the same, first converts, then outputs.

In C, how to print UTF-8 char if given its bytes in char variables?

2 Answers2