1

I am trying to read wide charaters into an array of wchar_t from stdin. However, the negated scanset specifier ([^characters]) for ls does not work preperly as expected.

The goal is that I want every whitespace read into str instead of being ignored. Hence, [^\n] is what I've tried, but with no luck, the result is frustrating and keeps printing garbled text to stdout.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>

int main(void)
{
    wchar_t str[8];

    if (setlocale(LC_ALL, "en_US.UTF-8") == NULL)  {
        fprintf(stderr, "Failed to set locale LC_ALL = en_US.UTF-8.\n");
        exit(EXIT_FAILURE);
    }

    // correct (but not what I want)
    // whitespaces and EOLs are ignored
    // while (wscanf(L"%7ls", str) != EOF)  {
    //     wprintf(L"%ls", str);
    // }

    // incorrect
    // whitespaces (except EOLs) are properly read into str (what I want)
    // input: 不要忽略白空格 (for instance)
    // output: endless loop (garbled text)
    while (wscanf(L"%7[^\n]ls", str) != EOF)  {
        if (ferror(stdin) && errno == EILSEQ)  {
            fprintf(stderr, "Encountered an invalid wide character.\n");
            exit(EXIT_FAILURE);
        }
        wprintf(L"%ls", str);
    }
}
Kevin Dong
  • 5,001
  • 9
  • 29
  • 62
  • What is the `ls` doing in `"%7[^\n]ls"`? Try `while (wscanf(L"%7[^\n]", str) == 1)` and then you need some code to consume the `'\n'`. – chux - Reinstate Monica Aug 29 '17 at 17:29
  • @chux That all format specifiers have the same meaning as in scanf; therefore, %lc shall be used to read a wide character (and not %c), as well as %ls shall be used for wide strings (and not %s). Source: [C++ Ref](http://www.cplusplus.com/reference/cwchar/wscanf/) – Kevin Dong Aug 29 '17 at 17:32
  • The reference does not suggest appending `ls` after `"%[something]"` `"%[...]"` and `"%s"` are different specifiers. – chux - Reinstate Monica Aug 29 '17 at 17:34
  • @chux `[^characters]` is called a specifier, and according to that webpage, appending `[^\n]` before `ls` shall have the same meaning as in `scanf`. `scanf("%[^\n]s", str);` is a well-known solution to cope with reserving whitespaces, so `wscanf` should also work as expected. – Kevin Dong Aug 29 '17 at 17:36
  • Try `while (wscanf(L"%7[^\n]", str) != EOF) { wscanf(L"%*1[\n]"); if (ferror(stdin) ...` – chux - Reinstate Monica Aug 29 '17 at 17:39
  • 2
    `scanf("%[^\n]s", str);` is a well known and _incorrect_ solution. The `s` is not needed, buffer over-run and `\n` is never consumed, which leads to problems here. – chux - Reinstate Monica Aug 29 '17 at 17:40
  • @chux That doesn't work. `scanf("%*[^\n]s", len - 1, str);` is better to avoid memory corruption. You can specify the maximum length that `scanf` is able to read. – Kevin Dong Aug 29 '17 at 17:42
  • `scanf("%*[^\n]s", len - 1, str);` confuses the `printf` and `scanf` specifiers. With `scanf()` , `*` directs to not save. The C spec calls it the "assignment-suppressing character" – chux - Reinstate Monica Aug 29 '17 at 17:43
  • @chux The problem is that `wscanf` is one of the proper way to read wide characters. `fread` is safer, but I don't want to cope with the unicode format. Do you have any proper way to read wide characters and also not ignore whitespaces? – Kevin Dong Aug 29 '17 at 17:46
  • 1
    Yes [this is close](https://stackoverflow.com/questions/45944875/dont-ignore-whitespaces-when-using-wscanf-for-utf-8?noredirect=1#comment78848824_45944875) as long as no lines are blank. Or better, use `fgetws()`. – chux - Reinstate Monica Aug 29 '17 at 17:49
  • @chux The [result](http://imgur.com/a/jRG1g) is garbled after applying your suggestions even if no lines are blank. – Kevin Dong Aug 29 '17 at 17:54
  • @chux Great. `fgetws` works. Thanks a lot. – Kevin Dong Aug 29 '17 at 18:02
  • @KevinDong - do notice that `fgetws` might **not** use UTF-8 (on linux it often does use UTF-8, see the [`LC_CTYPE`](https://stackoverflow.com/questions/30479607/explain-the-effects-of-export-lang-lc-ctype-lc-all) environment variable). On the other hand, UTF-8 characters use variable length (between 1 and 4 bytes) and white space is encoded the same as ASCII (1 byte, same ASCII value), so you should be able to safely ignore encoding when processing for white space. – Myst Aug 29 '17 at 18:40

1 Answers1

1

Don't ignore whitespaces ...
... trying to read wide characters into an array of wchar_t

To read a line of text (all characters, and white-spaces up to '\n') into a wide character string, use fgetws();

#define STR_SIZE 8
wchar_t str[STR_SIZE];

while (fgetws(str, STR_SIZE, str)) {
  // lop off the potential \n if desired
  size_t len = wcslen(str);
  if (len > 0 && str[len-1] == L'\n') {
    str[--len] = L'\0';
  }
  ...
}
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256