0

Task

At the moment I am porting old DOS code for a device to Linux in pure C. The text is drawn on the surface with the help of bitfonts. I wrote a function which needs the Unicode codepoint to be passed and then draws the corresponding glyph (tested and works with different ASCII and non-ASCII characters). The old source code used DOS encoding but I am trying to use UTF-8 since multilanguage support is desired. I cannot use SDL_ttf or similar functions since the produced glyphs are not "precise" enough. Therefore I have to stick with bitfonts.

Issue

I wrote a small C test program to test the conversion of multibyte characters to their corresponding Unicode codepoint (inspired by http://en.cppreference.com/w/c/string/multibyte/mbrtowc).

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <wchar.h>
#include <stdint.h>

int main(void)
{
   size_t n = 0, x = 0;
   setlocale(LC_CTYPE, "en_US.utf8");
   mbstate_t state = {0};
   char in[] = "!°水"; // or u8"zß水"
   size_t in_sz = sizeof(in) / sizeof (*in);

   printf("Processing %zu UTF-8 code units: [ ", in_sz);
   for(n = 0; n < in_sz; ++n)
   {
      printf("%#x ", (unsigned char)in[n]);
   }
   puts("]");

   wchar_t out[in_sz];
   char* p_in = in, *end = in + in_sz;
   wchar_t *p_out = out;
   int rc = 0;
   while((rc = mbrtowc(p_out, p_in, end - p_in, &state)) > 0)
   {
       p_in += rc;
       p_out += 1;
   }

   size_t out_sz = p_out - out + 1;
   printf("into %zu wchar_t units: [ ", out_sz);
   for(x = 0; x < out_sz; ++x)
   {
      printf("%u ", (unsigned short)out[x]);
   }
   puts("]");
}

The output is as expected:

Processing 7 UTF-8 code units: [ 0x21 0xc2 0xb0 0xe6 0xb0 0xb4 0 ] into 4 wchar_t units: [ 33 176 27700 0 ]

When I run this code on my embedded Linux device I get the following as output:

Processing 7 UTF-8 code units: [ 0x21 0xc2 0xb0 0xe6 0xb0 0xb4 0 ] into 2 wchar_t units: [ 33 55264 ] After the ! character the mbrtowc output is -1, which, according to the documentation, occurs when an encoding error happened. I tested it with different signs and this error occurs only with non-ASCII characters. Error never occurred on Linux computer

Additional Information

I am using a PFM-540I Rev. B as pc on the embedded device. The Linux distribution is built using Buildroot.

J.Panek
  • 425
  • 5
  • 16
  • Hmmm, embedded Linux device 2nd output is hex yet the expected output is decimal. Suggest hex `"%x "` for both to improve post clarity. Also recommend to review `rc` in each iteration of `while((rc ...` to see if it is not an unexpected value. – chux - Reinstate Monica May 02 '18 at 14:09
  • My bad, I tried different outputs and didn't pay attention what I post. Corrected the output so it now shows the decimal value. At first `rc = 1` but after the first character is processed and the `°` is next it changes to `rc = -1`. This stops the while loop because the encoding error has occurred. Hope this clarifies things a bit – J.Panek May 02 '18 at 14:22
  • Did `setlocale(LC_CTYPE, "en_US.utf8");` succeed on the _embedded Linux device_? (return a string or null pointer?) – chux - Reinstate Monica May 02 '18 at 14:27
  • Yes, the output is `(null)`. – J.Panek May 02 '18 at 14:32
  • If the _embedded Linux device_ return `null`, then "en_US.utf8" support is not expected. `mbrtowc()` should return -1. – chux - Reinstate Monica May 02 '18 at 14:33
  • So how can I change my locale (or other system settings) so that `mbrtowc()` expects UTF-8 characters? Encodings outside ASCII are pretty new to me. Edit: locale is not supported on the embedded device, I will add it first and report results. – J.Panek May 02 '18 at 14:41

1 Answers1

1

You need to make sure that the en_US.utf8 locale is available on the embedded Linux build. By default, Buildroot limits the locales installed on the system in two ways:

  • Only specific locales are generated, as specified by the BR2_GENERATE_LOCALE configure option. By default, this list is empty, so you only get the C locale. Set this config option to en_US.UTF-8.
  • All locale data is removed at the end of the build, except the ones specified in BR2_ENABLE_LOCALE_WHITELIST. en_US is already in the default value, so probably you don't need to change this.

Note that if you change these configuration options, you need to make a completely clean build (with make clean; make) for the change to take effect.

Arnout
  • 2,927
  • 16
  • 24
  • Worked fine, but had to generate `en_US.UTF-8` locale instead of `en_US.utf8`. The `en_US.utf8` locale resulted in an error while building (not found). Thank you. – J.Panek May 07 '18 at 08:23
  • OK, edited. When generating `en_US.UTF-8`, the `en_US.utf8` locale also becomes available, right? – Arnout May 08 '18 at 10:47
  • Yes, just verified it. Both locales become available. – J.Panek May 08 '18 at 16:03