how to detect non-ascii characters in C++ Windows?

Question

I'm simply trying detect non-ascii characters in my C++ program on Windows. Using something like isascii() or :

bool is_printable_ascii = (ch & ~0x7f) == 0 && 
                          (isprint() || isspace()) ;

does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:

#include <iostream>
using namespace std;
int main()
{
    int c;
    c = getchar();
    cout << isascii(c) << endl;
    cout << c << endl;
    printf("0x%x\n", c);
    cout << (char)c;
    return 0;
}

and input a (because i am so happy right now), the output is

1
63
0x3f
?

Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be

This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like

i = input()
print(ord(i))

gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?

On Windows, you probably need to use the wide input/output. – Eljay Jun 10 '21 at 19:32 — Eljay, Jun 10 '21 at 19:32
@Eljay, check out my comment under GandhiGandhi's post. – Dakoteus Jun 10 '21 at 20:01 — Dakoteus, Jun 10 '21 at 20:01

score 0 · Answer 1 · answered Jun 10 '21 at 19:40

0

Try replacing getchar() with getwchar(); I think you're right that its a Windows-only problem.

I think the problem is that getchar(); is expecting input as a char type, which is 8 bits and only supports ASCII. getwchar(); supports the wchar_t type which allows for other text encodings. "" isn't ASCII, and from this page: https://learn.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings , it seems like Windows encodes extended characters like this in UTF-16. I was having trouble finding a lookup table for utf-16 emoji, but I'm guessing that one of the bytes in the utf-16 "" is 0x39 which is why you're seeing that printed out.

answered Jun 10 '21 at 19:40

GandhiGandhi

1,029
6
10

1

`getchar` is more than just ASCII, it works with full 8-bit bytes although some may be expanded to negative numbers. – Mark Ransom Jun 10 '21 at 19:46
Hi, thanks for the quick response. Unfortunately switching to wchar_t and `getwchar();` doesn't solve the issue. I get the same outputs. I've tried it previously and just retried it again. It's not that wchar_t isn't wide enough either because even if I go just one over extended ascii (256), I still get a 'converted' ascii character. That is, Ā -> A. – Dakoteus Jun 10 '21 at 19:57

score 0 · Accepted Answer · answered Jun 14 '21 at 18:49

Okay, I have solved this. I was not aware of translation modes.

_setmode(_fileno(stdin), _O_WTEXT);

Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened. https://en.cppreference.com/w/cpp/language/translation_phases

how to detect non-ascii characters in C++ Windows?

2 Answers2