4

I've got a text file, foo.txt, with these contents:

R⁸2

I had a large program reading it and doing things with each character, but it always received EOF when it hit the . Here's the relevant portions of the code:

setlocale(LC_ALL,"");

FILE *in = fopen(argv[1],"r");

while (1) {
    wint_t c = getwc(in);
    printf("%d ",wctob(c));

    if (c == -1)
        printf("Error %d: %s\n",errno,strerror(errno));

    if (c == WEOF)
        return 0;
}

It prints 82 -1 (the ASCII codes for R and EOF). No matter where I have the ¹ in the file, it always reads as EOF. Edit, I added a check for errno and it gives this:

Error 84: Invalid or incomplete multibyte or wide character

However, ⁸ is Unicode U+2078 'SUPERSCRIPT EIGHT'. I wrote it to foo.txt via cat and copy-pasting from fileformat.info. A hexdump of foo.txt shows:

0000000: 52e2 81b8 32                             R...2

What's the problem?

MD XF
  • 7,860
  • 7
  • 40
  • 71
  • 3
    You need to check for `WEOF` instead of `EOF`, also change `int` to `wint_t`. Take a look to the documentation: http://www.cplusplus.com/reference/cwchar/getwc/ – David Ranieri Aug 11 '17 at 18:14
  • 2
    And the [docs](https://msdn.microsoft.com/en-us/library/0k477bzh.aspx) for `int wctob(wint_t wchar);` too. *If `wctob` successfully converts a wide character, it returns its multibyte character representation, only if the multibyte character is exactly one byte long. If `wctob` encounters a wide character it cannot convert to a multibyte character or the multibyte character is not exactly one byte long, it returns a `–1`.* – Weather Vane Aug 11 '17 at 18:20
  • @KeineLust done, updated. – MD XF Aug 11 '17 at 18:23
  • Which OS and C compiler are you using? I think the combination of (OS, compiler, locale) is the issue. For example, your code runs fine on macOS using clang (with the issue mentioned by @WeatherVane that it will print -1 for the '⁸' character), that is, the output is `82 -1 50 10 -1`. – idz Aug 11 '17 at 19:11
  • @idz Raspbian PIXEL 4.9.2, gcc 4.9, locale cleared using `setlocale(LC_ALL,"");`. – MD XF Aug 11 '17 at 19:13
  • Looks like your input file is UTF-8, but your default locale is not. What have you set `LC_CTYPE` and `LANG` environment variables to? – Chris Dodd Aug 11 '17 at 19:17
  • @ChrisDodd `locale`'s output is `LANG=C ... LC_CTYPE="C"`, the `...` denoting extraneous information. – MD XF Aug 11 '17 at 19:19
  • 2
    Try using `C.UTF-8` or some other UTF-8 locale. What does `locale -a` tell you are supported locales on your system? – Chris Dodd Aug 11 '17 at 19:20
  • 1
    @ChrisDodd Fixed! If you post that as an answer I'll accept. Also, do you know why `setlocale(LC_ALL,"");` didn't fix the problem? I was under the assumption that would portably set it to a codepage compatible with Unicode. – MD XF Aug 11 '17 at 19:22
  • @MDXF `setlocale(LC_ALL,"");` is a way to set the default locale. It existed before Unicode and so is not specified to be Unicode aware. – chux - Reinstate Monica Aug 11 '17 at 20:35

1 Answers1

0

1. Check for WEOF instead of EOF

EOF is meant for single-byte characters. WEOF is for wide characters. When reading the start of a wide character with getwc, single-byte EOF can sometimes be returned.

In stdio.h:

#define EOF (-1)

In wchar.h:

#define WEOF (0xffffffffu)

2. Set the locale to one supporting Unicode

The default locale of a C program is C, also called POSIX, which is only meant for ASCII. Using setlocale, it is sometimes necessary to explicitly set the appropriate locales to codepages that support Unicode. C.UTF-8 is portable.

setlocale(LC_ALL,"C.UTF-8");
setlocale(LC_CTYPE,"C.UTF-8");

3. Use the proper type for wide characters

The return value of getwc isn't char, int or even wchar_t, it's wint_t. Make sure that your character variable c is of type wint_t, to avoid memory problems.

MD XF
  • 7,860
  • 7
  • 40
  • 71