Don't ignore whitespaces when using wscanf for UTF-8

Question

I am trying to read wide charaters into an array of wchar_t from stdin. However, the negated scanset specifier ([^characters]) for ls does not work preperly as expected.

The goal is that I want every whitespace read into str instead of being ignored. Hence, [^\n] is what I've tried, but with no luck, the result is frustrating and keeps printing garbled text to stdout.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>

int main(void)
{
    wchar_t str[8];

    if (setlocale(LC_ALL, "en_US.UTF-8") == NULL)  {
        fprintf(stderr, "Failed to set locale LC_ALL = en_US.UTF-8.\n");
        exit(EXIT_FAILURE);
    }

    // correct (but not what I want)
    // whitespaces and EOLs are ignored
    // while (wscanf(L"%7ls", str) != EOF)  {
    //     wprintf(L"%ls", str);
    // }

    // incorrect
    // whitespaces (except EOLs) are properly read into str (what I want)
    // input: 不要忽略白空格 (for instance)
    // output: endless loop (garbled text)
    while (wscanf(L"%7[^\n]ls", str) != EOF)  {
        if (ferror(stdin) && errno == EILSEQ)  {
            fprintf(stderr, "Encountered an invalid wide character.\n");
            exit(EXIT_FAILURE);
        }
        wprintf(L"%ls", str);
    }
}

What is the `ls` doing in `"%7[^\n]ls"`? Try `while (wscanf(L"%7[^\n]", str) == 1)` and then you need some code to consume the `'\n'`. — chux - Reinstate Monica, Aug 29 '17 at 17:29
@chux That all format specifiers have the same meaning as in scanf; therefore, %lc shall be used to read a wide character (and not %c), as well as %ls shall be used for wide strings (and not %s). Source: [C++ Ref](http://www.cplusplus.com/reference/cwchar/wscanf/) — Kevin Dong, Aug 29 '17 at 17:32
The reference does not suggest appending `ls` after `"%[something]"` `"%[...]"` and `"%s"` are different specifiers. — chux - Reinstate Monica, Aug 29 '17 at 17:34
@chux `[^characters]` is called a specifier, and according to that webpage, appending `[^\n]` before `ls` shall have the same meaning as in `scanf`. `scanf("%[^\n]s", str);` is a well-known solution to cope with reserving whitespaces, so `wscanf` should also work as expected. — Kevin Dong, Aug 29 '17 at 17:36
Try `while (wscanf(L"%7[^\n]", str) != EOF) { wscanf(L"%*1[\n]"); if (ferror(stdin) ...` — chux - Reinstate Monica, Aug 29 '17 at 17:39
`scanf("%[^\n]s", str);` is a well known and _incorrect_ solution. The `s` is not needed, buffer over-run and `\n` is never consumed, which leads to problems here. — chux - Reinstate Monica, Aug 29 '17 at 17:40
@chux That doesn't work. `scanf("%*[^\n]s", len - 1, str);` is better to avoid memory corruption. You can specify the maximum length that `scanf` is able to read. — Kevin Dong, Aug 29 '17 at 17:42
`scanf("%*[^\n]s", len - 1, str);` confuses the `printf` and `scanf` specifiers. With `scanf()` , `*` directs to not save. The C spec calls it the "assignment-suppressing character" — chux - Reinstate Monica, Aug 29 '17 at 17:43
@chux The problem is that `wscanf` is one of the proper way to read wide characters. `fread` is safer, but I don't want to cope with the unicode format. Do you have any proper way to read wide characters and also not ignore whitespaces? — Kevin Dong, Aug 29 '17 at 17:46
Yes [this is close](https://stackoverflow.com/questions/45944875/dont-ignore-whitespaces-when-using-wscanf-for-utf-8?noredirect=1#comment78848824_45944875) as long as no lines are blank. Or better, use `fgetws()`. — chux - Reinstate Monica, Aug 29 '17 at 17:49
@chux The [result](http://imgur.com/a/jRG1g) is garbled after applying your suggestions even if no lines are blank. — Kevin Dong, Aug 29 '17 at 17:54
@KevinDong - do notice that `fgetws` might **not** use UTF-8 (on linux it often does use UTF-8, see the [`LC_CTYPE`](https://stackoverflow.com/questions/30479607/explain-the-effects-of-export-lang-lc-ctype-lc-all) environment variable). On the other hand, UTF-8 characters use variable length (between 1 and 4 bytes) and white space is encoded the same as ASCII (1 byte, same ASCII value), so you should be able to safely ignore encoding when processing for white space. — Myst, Aug 29 '17 at 18:40

chux - Reinstate Monica · Accepted Answer · 2017-08-29T18:47:53.413

Don't ignore whitespaces ...
... trying to read wide characters into an array of wchar_t

To read a line of text (all characters, and white-spaces up to '\n') into a wide character string, use fgetws();

#define STR_SIZE 8
wchar_t str[STR_SIZE];

while (fgetws(str, STR_SIZE, str)) {
  // lop off the potential \n if desired
  size_t len = wcslen(str);
  if (len > 0 && str[len-1] == L'\n') {
    str[--len] = L'\0';
  }
  ...
}

Don't ignore whitespaces when using wscanf for UTF-8

1 Answers1