The following C code reads lines from stdin
using fgetws()
and writes them to stdout
.
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define STR_LEN 128
int main(int argc, char **argv)
{
FILE *infile = stdin, *outfile = stdout;
wchar_t str[STR_LEN];
if (setlocale(LC_ALL, "en.UTF-8") == NULL) {
fprintf(stderr, "Cannot set locale\n");
return 1;
}
for (;;) {
if (!fgetws(str, STR_LEN, infile)) {
if (feof(infile)) {
break;
}
perror("fgetws()");
continue;
}
str[wcscspn(str, L"\r\n")] = L'\0';
if (fwprintf(outfile, L"%ls\n", str) < 0) {
perror("fwprintf()");
}
}
return 0;
}
It always works perfectly with ASCII files, but sometimes it gets EILSEQ
error (Illegal byte sequence) from fgetws()
when reading UTF-8 data, and I cannot figure out why.
In the output file, the line which causes the error is truncated, then some characters are missing and the remaining part is on the next line. The strange thing is that if I give only that line, then I will not get any error.
For example, if I read a file with just a few UTF-8 lines, it's ok; if I repeat the same lines many times, then I get EILSEQ
.
I'm almost sure that the file is correctly encoded.
I use Linux with musl-libc.
What's wrong in my code?
EDIT:
I get several EILSEQ
errors, depending on the input size, but I don't know the exact relationship beetween the two.
With the same input I get the same errors at the same lines.
It does not seem to be a specific offset or character which triggers the error, but I might be wrong.
EDIT 2: I tested the code also on OpenBSD and it works. At this point I suspect that this issue is related to Linux or musl-libc.