Process UTF-8 characters in C from a text file

Question

I need to read UTF-8 characters from a text file and process them. for instance to calculate the frequency of occurrence of a certain character. Ordinary characters are fine. The problem occurs with characters like ü or ğ. following is my code to check if a certain character occurs comparing the ascii code of the incoming character:

FILE * fin;
FILE * fout;
wchar_t c;
fin=fopen ("input.txt","r");
fout=fopen("out.txt","w");
int frequency = 0;
while((c=fgetwc(fin))!=WEOF)
{
   if(c == SOME_NUMBER){ frequency++; }
}

SOME_NUMBER is what I can't figure out for those characters. Infact those characters print out 5 different numbers when trying to print it as a decimal. whereas for example for character 'a' I would do as: if(c == 97){ frequency++; } since the ascii code of 'a' is 97. Is there anyway that I could identify those special characters in C?

P.S. working with ordinary char ( not wchar_t ) creates the same problem, but this time printing the decimal equivalent of the incoming character would print 5 different NEGATIVE numbers for those special characters. Problem stands.

Most characters take more than one byte. So you should read and compare multiple bytes. Regarding `wchar_t`, I think it's implementation-defined what character encoding functions like `fgetwc` assume, and on many systems it's not UTF-8. — , Nov 14 '14 at 12:42
Depending on the font used, several code points may have identical or almost identical glyphs. They are still different code points in unicode. What is the input spectrum and what spectrum (encoding) do you want to map it to? — Klas Lindbäck, Nov 14 '14 at 12:45
@didierc how would I create such a table? could you please give me some tips on it in the reply? what should I assign for those special characters in my table? — Ams, Nov 14 '14 at 12:57
Here's an example how it could be done: http://stackoverflow.com/questions/11156473/is-there-a-way-to-convert-from-utf8-to-iso-8859-1 — Klas Lindbäck, Nov 14 '14 at 13:19

Jens Gustedt · Answer 1 · 2014-11-14T13:36:15.380

A modern C platform should provide everything you need for such a task.

First thing that you have to be sure is that your program runs under a locale that can handle utf8. Your environement should already be set to that, the only thing you have to do in your code is

setlocale(LC_ALL, "");

to switch from the "C" locale to your native environment.

Then you can read strings as usual with fgets, e.g. To do comparisons for accented characters and stuff you'd have to convert such a string to a wide character string (mbsrtowcs) as you already mention. The encoding of such wide characters is implementation defined, but you don't need to know that encoding to do checks.

Usually something like L'ä' will work perfectly as long as the platform on which you compile and where you execute are not completely screwed up. If you need codes that you can't even enter on the keyboard you can use the L'\uXXXX' notation from C11 as didierc mentions in his answer. ('L'\uXXXX' is for the "basic" characters, if you have something really weird you'd use L'\UXXXXXXXX', a capital U with 8 hex-digits)

As said, the encoding for wide characters is implementation defined, but good chances are that it is either utf-16 or utf-32, which you can check with sizeof(wchar_t) and the predefined macro __STDC_ISO_10646__. Even if your platform only supports utf-16 (which may have 2-word "characters") the use case you describe shouldn't cause any trouble since all your characters can be coded with the L'\uXXXX' form.

Not sure on this, but I believe if `wchar_t` has 16 bits, it cannot represent codepoints outside the BMP. (In other words, it is UCS-2, not UTF-16, in this case; no two-word characters.) — mafso, Nov 14 '14 at 13:52
@mafso, both UCS-2 and UTF-16 are possible, but UCS-2 is rare, nowadays. Anyhow, the OP only seems to be interested in the BMP, so this shouldn't even matter for him, as for most people. (The value of the `__STDC_ISO_10646__` should also be an indication which of the two applies.) — Jens Gustedt, Nov 14 '14 at 13:56
UTF-16 is not a possible encoding for `wchar_t`, because the C API for wide characters fundamentally does not admit a conversion from multibyte to wide characters to produce more than one wide character. Only UTF-32 (UCS-2) and UCS-2 are possible. — R.. GitHub STOP HELPING ICE, Nov 16 '14 at 15:29
@R.., I would think so, too, but I am not sure that this isn't done, by some, in particular windows. I vaguely remember that I found some sample code by them that reads UTF8 byte by byte (but with the `mbrtowc` functions or so) to produce a two-word UTF-16, and that they were advertising that their `wchar_t` strings are UTF-16. Anyhow, as soon as supposing characters in the BMP, everything is fine. — Jens Gustedt, Nov 16 '14 at 20:33

V-X · Answer 2 · 2014-11-14T14:59:41.767

You can create your own utf-8 decoding reading function.

see the format description in https://en.wikipedia.org/wiki/UTF-8

this code is not very nice and robust. But it is the sketch of what I ment...

#include <stdio.h>
#include <stdlib.h>

#define INVALID (-2)

int fgetutf8c(FILE* f)
{
    int result = 0;
    int input[6] = {};

    input[0] = fgetc(f);
    printf("(i[0] = %d) ", input[0]);
    if (input[0] == EOF)
    {
        // The EOF was hit by the first character.
        result = EOF;
    }
    else if (input[0] < 0x80)
    {
        // the first character is the only 7 bit sequence...
        result = input[0];
    }
    else if ((input[0] & 0xC0) == 0x80)
    {
        // This is not the beginning of the multibyte sequence.
        return INVALID;
    }
    else if ((input[0] & 0xfe) == 0xfe)
    {
        // This is not a valid UTF-8 stream.
        return INVALID;
    }
    else
    {
        int sequence_length;
        for(sequence_length = 1; input[0] & (0x80 >> sequence_length); ++sequence_length);
        result = input[0] & ((1 << sequence_length) - 1);
        printf("squence length = %d ", sequence_length);
        int index;
        for(index = 1; index < sequence_length; ++index)
        {
            input[index] = fgetc(f);
            printf("(i[%d] = %d) ", index, input[index]);
            if (input[index] == EOF)
            {
                return EOF;
            }
            result = (result << 6) | (input[index] & 0x30);
        }
    }
    return result;
}

main(int argc, char **argv)
{
   printf("open(%s) ", argv[1]);
   FILE *f = fopen(argv[1], "r");
   int c = 0;
   while (c != EOF)
   {
       c = fgetutf8c(f);
       printf("* %d\n", c);
   }
   fclose(f);
}

Could you please give me some more advice on that? perhaps how to start reading from file and recognizing characters — Ams, Nov 14 '14 at 13:06
No, one should certainly not do that, your C library is there for you to that, don't reinvent the wheel. — Jens Gustedt, Nov 14 '14 at 13:24
@JensGustedt unless of course someone wants to learn how to invent wheels. eventually all the inventors will die out, and then where will we be? — unsynchronized, Aug 30 '15 at 05:03

n0p · Answer 3 · 2020-03-31T07:59:22.953

This is a proposal for a solution that does not involves wide characters:

From Wikipedia: design of UTF-8 multi-bytes sequences

The leadings "1" of the 1st byte gives the count of following bytes "10" at the beginning of a byte signals a continuation byte "0" as a 1st byte signals a single-byte sequence

Byte 1 Byte 2 Byte 3 Byte 4 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Therefore you must first know if you are positioned on a multi-byte sequence by testing:

char byte;
// ...
if((byte & 0xC0) == 0x80)
{
    // Handle multi-byte
}

Then you have to accumulate the byte until the sequence is completed (count leading 1 to know how much iterations you need) and finally you will get your unique unicode character and can associate a frequency.

Note that the string.h API works fine with UTF-8 multi-byte sequence. For example, you can find the occurrences of ü (0xC3 0xBC) in a string str:

char sequence[] = {0xC3, 0xBC};
size_t count = 0
for(;*str*;str++)
{
    str = strstr(str,sequence);
    if(str)
    {
        count++;
    }
}

`111110xx ...` and `1111110x ...` are no longer part of UTF-8. See https://en.wikipedia.org/wiki/UTF-8#Description — chux - Reinstate Monica, Nov 14 '14 at 16:51

score 2 · Answer 4 · edited Mar 05 '20 at 14:21

If you need to include wide character literals in your code, you may do that using the following notation:

whar_t c = L'\u0041'; // 'A'

But I believe that you shouldn't need that, if what you want to do is keeping frequency stats of characters. The wchar_t type let you easily compare values as any other integral types:

wchar_t c1 = L'\u0041', c2 = L'\u0030';
int r = c1 == c2; // 0

With this comparison operator and functions to extract wchar_t from your data stream, you should be able to build an associative table from wchar_t to unsigned int using your input characters only (C hashtable implementations abound on the web).

Perhaps one important point here is that wide chars and utf8 chars are different types: the function fgetwc will yield a value of wint_t(wide integer type) - which is an integral type englobing wchar_t (itself of size 16 or 32bits), while utf8 chars may occupy from 1 up to 4 bytes (so 8 to 32bits) in a plain char *. Since you get wchar_t directly, you actually don't have to worry about utf8 encoding.

I just pushed an edit; I don't even remember what it said (I'm that tired) but I think maybe `win_t` but `fgetwc()` returns a `wint_t` (for wide int type). I added in parentheses the `(wide integer type)` because the edit required more characters than changing the type to be. Please feel free to modify it to your own wording (typically I don't even like to edit another person's work but...). Cheers. — Pryftan, Mar 05 '20 at 13:09

Process UTF-8 characters in C from a text file

4 Answers4