0

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?

Similarly for the 3 and 4 byte UTF-8 characters?

I've been trying all kinds of approaches with mbstowcs() but I just can't get it to work.

SO Stinks
  • 3,258
  • 4
  • 32
  • 37

2 Answers2

1

I managed to write a working example.
When c1 is '\xce' and c2 is '\xb8', the result is θ.
It turns out that I have to call setlocale before using mbstowcs.

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
 
int main()
{
   char* localeInfo = setlocale(LC_ALL, "en_US.utf8");
   printf("Locale information set to %s\n", localeInfo);
   
   const char c1 = '\xce';
   const char c2 = '\xb8';
   int byteCount = 2;

   char* mbS = (char*) malloc(byteCount + 1);
   mbS[0] = c1; 
   mbS[1] = c2; 
   mbS[byteCount] = 0; //null terminator
   printf("Directly using printf: %s\n", mbS);
   
   
   int requiredSize = mbstowcs(NULL, mbS, 0); 
   printf("Output size including null terminator is %d\n\n", requiredSize +1);
   
   wchar_t *wideOutput = (wchar_t *)malloc( (requiredSize +1) * sizeof( wchar_t ));
   
   int len = mbstowcs(wideOutput , mbS, requiredSize +1 ); 
   if(len == -1){
       printf("Failed conversion!");
   }else{
       printf("Converted %d character(s). Result: %ls\n", len, wideOutput );
   }
   return 0;
    
}

Output:

Locale information set to en_US.utf8
Directly using printf: θ
Output size including null terminator is 2

Converted 1 character(s). Result: θ

For 3 or 4 byte utf8 characters, one can use a similar approach.

qrsngky
  • 2,263
  • 2
  • 13
  • 10
1

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?

They are already an UTF-8 character. You would just print them.

putchar(c1);
putchar(c2);

It's up to your terminal or whatever device you are using to display the output to properly understand and render the UTF-8 encoding. This is unrelated to encoding used by your program and unrelated to wide characters.

Similarly for the 3 and 4 byte UTF-8 characters?

You would output them.


If your terminal or the device you are sending the bytes to does not understand UTF-8 encoding, then you have to convert the bytes to something the device understands. Typically, you would use an external library for that, like iconv. Alternatively, you could setlocale("C.utf-8") then convert your bytes to wchar_t, then setlocale("C.your_target_encoding") and then convert the bytes to that encoding or output the bytes with %ls. All %ls does (on common systems) is it converts the string back to multibyte and then outputs it. Wide stream outputting to terminal does the same, first converts, then outputs.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111