C - How to avoid diacritic/accents sensitive issues

Question

I'm creating a tiny program of guessing the capitals of countries. Some of the capitals have accents, cedillas, etc.

Since I have to compare the capital and the text the user guessed, and I don't want an accent to mess up the comparison, I went digging the internet for some way of accomplishing that.

I came across countless solutions to another programming languages however only a couple of results about C.

None of them actually worked with me. Although, I came to conclusion that I'd have to use the wchar.h library to deal with those annoying characters

I made this tiny bit of code (which replaces É with E) just to check this method and against all I read and understand it doesn't work, even printing the wide char string doesn't show diacritic characters. If it worked, I'm sure I could implement this on the capitals' program so I'd appreciate if someone can tell me what's wrong.

#include<stdio.h>
#include<locale.h>
#include<wchar.h>

const wchar_t CAPITAL_ACCUTE_E = L'\u00C9';

int main()
{
    wchar_t wbuff[128];
    setlocale(LC_ALL,"");
    fputws(L"Say something: ", stdout);
    fgetws(wbuff, 128, stdin);
    int n;
    int len = wcslen(wbuff);
    for(n=0;n<len;n++)
        if(wbuff[n] == CAPITAL_ACCUTE_E)
            wbuff[n] = L'E';
    wprintf(L"%ls\n", wbuff);
    return 0;
}

That's a problematic subject in standard C. First make clarify which input encoding your platform uses, then take appropriate measures. — too honest for this site, Jul 17 '16 at 21:08
As @Olaf said: You need to know the input encoding. Your example works well with `LANG=en_US.UTF-8` in bash ( I C&P your line "which replaces É with E" for the input). You already use `setlocale(3)`, just read the output and act accordingly (the hardest part, if you ask me). — deamentiaemundi, Jul 17 '16 at 21:56
With `char`, I have used `tolower(toupper(ch))` to fold and fold again letters that are "alike". Perhaps a `wchar_t` equivalent? Maybe `towctrans()`? — chux - Reinstate Monica, Jul 18 '16 at 00:03

a3f · Answer 1 · 2016-07-17T23:24:23.077

An issue you overlooked is that É can be represented as

É - LATIN CAPITAL LETTER E WITH ACUTE, codepoint U+00C9 (c3 89 in UTF-8), or
É - LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT, codepoints U+0045 U+0301 (45 cc 81 in UTF-8)

You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E, which you then can strcmp as usual.

Assuming you've got an UTF-8 encoded input, here is how you could do it with utf8proc:

#include <utf8proc.h>

utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output, 
                           UTF8PROC_NULLTERM | UTF8PROC_STABLE |
                           UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
                           UTF8PROC_CASEFOLD
                          );

This would turn all of É, É and E to a plain e.

C - How to avoid diacritic/accents sensitive issues

1 Answers1