1

I am processing unicode strings in C with libunistring. Can't use another library. My goal is to read a single character from the unicode string at its index position, print it, and compare it to a fixed value. This should be really simple, but well ...

Here's my try (complete C program):

/* This file must be UTF-8 encoded in order to work */

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>


int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
    int result = u32_cmp(&charExpected, &charActual, 1);
    if (result == 0) {
        printf("%s is recognized as '%lc', good!\n", label, charExpected);
    } else {
        printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
    }
    return result;
}


int main() {
    setlocale(LC_ALL, "");     /* switch from default "C" encoding to system encoding */
    const char *enc = locale_charset();
    printf("Current locale charset: %s (should be UTF-8)\n\n", enc);

    const char *buf = "foo 楽あり bébé";
    const uint32_t *mbcs = u32_strconv_from_locale(buf);

    printf("%s\n", u32_strconv_to_locale(mbcs));

    uint32_t c0 = mbcs[0];
    uint32_t c5 = mbcs[5];
    uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];

    printf(" - char 0: %lc\n", c0);
    printf(" - char 5: %lc\n", c5);
    printf(" - last  : %lc\n", cLast);

    /* When this file is UTF-8-encoded, I'm passing a UTF-8 character
     * as a uint32_t, which should be wrong! */
    cmpchr("Char 0", 'f', c0);
    cmpchr("Char 5", 'あ', c5);
    cmpchr("Last char", 'é', cLast);

    return 0;
}

In order to run this program:

  1. Save the program to a UTF-8 encoded file called ustridx.c
  2. sudo apt-get install libunistring-dev
  3. gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
  4. Make sure the terminal is set to a UTF-8 locale (locale)
  5. Run it with ./ustridx

Output:

Current locale charset: UTF-8 (should be UTF-8)

foo 楽あり bébé
 - char 0: f
 - char 5: あ
 - last  : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.

The desired behavior is that char 5 and last char are recognized correctly, and printed correctly in the last two lines of the output.

barfuin
  • 16,865
  • 10
  • 85
  • 132
  • Are the string literal you check UTF-8 encoded? How does your editor save it? Have you checked with e.g. a hex-editor that the source file is saved in UTF-8? – Some programmer dude Jan 24 '21 at 15:29
  • Yes, the file is correctly UTF-8 encoded, terminal encoding is correct (also UTF-8). Output shown confirms this. I am fairly sure it's not just the editor. The code is wrong. @Someprogrammerdude – barfuin Jan 24 '21 at 15:31
  • And when you used your debugger to inspect the contents of `c0`, what did your debugger show? And when you continued to use your debugger to run your program, one line at a time, what did your debugger show to be happening in your `cmpchr` function? – Sam Varshavchik Jan 24 '21 at 15:31
  • I don't have a debugger, unfortunately. Does the output help us determine this? @SamVarshavchik – barfuin Jan 24 '21 at 15:33
  • Unfortunately, knowing and being able to use a debugger is a required skill for every C++ developer. You cannot expect to be able to develop C++ programs of any moderate complexity without a debugger. The output carries only the information that it shows. In its current form, it doesn't show the results of every logical comparison, and the actual raw values of all variables, and report their values every time they change. This is what a debugger is for. It is certainly possible to add more diagnostic output, and sometimes that's a useful debugging technique on its own. – Sam Varshavchik Jan 24 '21 at 15:34
  • 1
    *Last char is recognized as '쎩'* because `쎩` is `U+C3A9` an UTF-8 for `é` is `0xC3,0xA9`. Flagrant [mojibake](https://en.wikipedia.org/wiki/Mojibake) case… – JosefZ Jan 24 '21 at 15:39
  • Good catch, @JosefZ, thank you! But how to fix? – barfuin Jan 24 '21 at 15:41
  • I actually compiled and debugged this. Based on what I saw in my debugger, I concluded that your use of one of the libunistring's functions is incorrect. Unless you are planning to ask for help on stackoverflow every time your program doesn't work and you can't figure out why, it's necessary to learn how to use a debugger. Doesn't it make more sense to learn how to debug one's own code and fix bugs by yourself, instead of always asking others for help? – Sam Varshavchik Jan 24 '21 at 15:51
  • Right. I needed a debugger to track down what the issue with the comparison is. I never said that it's easy to figure out code issues without a debugger. I said quite the opposite, actually. – Sam Varshavchik Jan 24 '21 at 15:57

2 Answers2

1

From libunistring's documentation:

 Compares S1 and S2, each of length N, lexicographically.  Returns a
 negative value if S1 compares smaller than S2, a positive value if
 S1 compares larger than S2, or 0 if they compare equal.

The comparison in the if statement was wrong. That was the reason for the mismatch. Of course, this reveals other, unrelated, issues that also need to be fixed. But, that's the reason for the puzzling result of the comparison.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148
  • Thanks for spotting this! Fixed in the question so others don't get sidetracked by this. But now of course the comparison is still wrong for the multi-byte characters. – barfuin Jan 24 '21 at 15:57
1

'あ' and 'é' are invalid character literals. Only characters from the basic source character set and escape sequences are allowed in character literals.

GCC however emits a warning (see godbolt) saying warning: multi-character character constant. This is a different case, and is about character constants such as 'abc', which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.

Since C11 you can use UTF-32 character literals such as U'あ' which results in a char32_t value of the Unicode code point of the character. Although by my reading the standard doesn't allow using characters such as あ in literals, the examples on cppreference seem to suggest that it is common for compilers to allow this.
A standard-compliant portable solution is using Unicode escape sequences for the character literal, like U'\u3042' for あ, but this is hardly different from using an integer constant such as 0x3042.

IlCapitano
  • 1,994
  • 1
  • 7
  • 15
  • Thanks for getting straight to the core of the problem, and providing many helpful links for further learning. +1 – barfuin Jan 24 '21 at 19:39