1

The following C code reads lines from stdin using fgetws() and writes them to stdout.

#include <stdio.h>
#include <locale.h>
#include <wchar.h>

#define STR_LEN 128

int main(int argc, char **argv)
{
    FILE *infile = stdin, *outfile = stdout;
    wchar_t str[STR_LEN];

    if (setlocale(LC_ALL, "en.UTF-8") == NULL) {
        fprintf(stderr, "Cannot set locale\n");
        return 1;
    }


    for (;;) {

        if (!fgetws(str, STR_LEN, infile)) {
            if (feof(infile)) {
                break;
            }
            perror("fgetws()");
            continue;
        }
        str[wcscspn(str, L"\r\n")] = L'\0';

        if (fwprintf(outfile, L"%ls\n", str) < 0) {
            perror("fwprintf()");
        }

    }

    return 0;
}

It always works perfectly with ASCII files, but sometimes it gets EILSEQ error (Illegal byte sequence) from fgetws() when reading UTF-8 data, and I cannot figure out why.

In the output file, the line which causes the error is truncated, then some characters are missing and the remaining part is on the next line. The strange thing is that if I give only that line, then I will not get any error.

For example, if I read a file with just a few UTF-8 lines, it's ok; if I repeat the same lines many times, then I get EILSEQ.

I'm almost sure that the file is correctly encoded.

I use Linux with musl-libc.

What's wrong in my code?

EDIT: I get several EILSEQ errors, depending on the input size, but I don't know the exact relationship beetween the two.

With the same input I get the same errors at the same lines.

It does not seem to be a specific offset or character which triggers the error, but I might be wrong.

EDIT 2: I tested the code also on OpenBSD and it works. At this point I suspect that this issue is related to Linux or musl-libc.

Rat Salad
  • 23
  • 4

1 Answers1

0

UTF-8 uses char to store characters, it works the same way as ANSI. The only difference is that a linguistic character can be longer than one character.

wchar_t and wide-c string functions are used for UTF-16 in Windows. In Linux you would use char16_t to store UTF-16, but only if you are working with UTF-16 files. That's clearly not the case here.

Just use char functions to handle UTF-8, the exact same way you work with ANSI:

char str[STR_LEN];
while(fgets(str, STR_LEN, infile))
{
    str[strcspn(str, "\r\n")];
    fprintf(outfile, "%s\n", str);
}
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • Yes, the above program will work even with `char` strings. But if I need to manipulate single characters after reading, then I will need `wchar_t`, right? I prefer to not use external libraries (e.g. ICU), because I need to do only simple things. – Rat Salad Nov 19 '17 at 09:07
  • Aren't you using Linux? `whcar_t` is used in Windows, it's rarely used in Linux. Just use `char`. You can parse the text if you are just searching for ANSI characters like `'\n'` or `','` because those characters don't repeat elsewhere. Treat it as regular ANSI text. But if you have something like `"ελληνικά"` then it's very difficult to find `'η'`, because `'η'` is a combination of different characters, you would rarely need that option. – Barmak Shemirani Nov 19 '17 at 09:16
  • Unfortunately I need to perform operations on non-ASCII characters. – Rat Salad Nov 19 '17 at 09:22
  • Okay, then you may need external libraries. Your console input is UTF-8. You have to convert to UTF-32. Then store it in `char32_t` (or `wchar_t`). Now each character occupies 4 bytes. That's if you wanted to find the position of `'η'` in `"ελληνικά"`. Note however, it's easy to find the position of `','` in `"ελλ,ηνικά"` (it won't be at index 3, but you just go through the string and find it) – Barmak Shemirani Nov 19 '17 at 09:28