-1

I'm simply trying detect non-ascii characters in my C++ program on Windows. Using something like isascii() or :

bool is_printable_ascii = (ch & ~0x7f) == 0 && 
                          (isprint() || isspace()) ;

does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:

#include <iostream>
using namespace std;
int main()
{
    int c;
    c = getchar();
    cout << isascii(c) << endl;
    cout << c << endl;
    printf("0x%x\n", c);
    cout << (char)c;
    return 0;
}

and input a (because i am so happy right now), the output is

1
63
0x3f
?

Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be

1
72
0x48
H

This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like

i = input()
print(ord(i))

gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?

Dakoteus
  • 1
  • 1
  • 3

2 Answers2

0

Try replacing getchar() with getwchar(); I think you're right that its a Windows-only problem.

I think the problem is that getchar(); is expecting input as a char type, which is 8 bits and only supports ASCII. getwchar(); supports the wchar_t type which allows for other text encodings. "" isn't ASCII, and from this page: https://learn.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings , it seems like Windows encodes extended characters like this in UTF-16. I was having trouble finding a lookup table for utf-16 emoji, but I'm guessing that one of the bytes in the utf-16 "" is 0x39 which is why you're seeing that printed out.

GandhiGandhi
  • 1,029
  • 6
  • 10
  • 1
    `getchar` is more than just ASCII, it works with full 8-bit bytes although some may be expanded to negative numbers. – Mark Ransom Jun 10 '21 at 19:46
  • Hi, thanks for the quick response. Unfortunately switching to wchar_t and `getwchar();` doesn't solve the issue. I get the same outputs. I've tried it previously and just retried it again. It's not that wchar_t isn't wide enough either because even if I go just one over extended ascii (256), I still get a 'converted' ascii character. That is, Ā -> A. – Dakoteus Jun 10 '21 at 19:57
0

Okay, I have solved this. I was not aware of translation modes.

_setmode(_fileno(stdin), _O_WTEXT);

Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened. https://en.cppreference.com/w/cpp/language/translation_phases

Dakoteus
  • 1
  • 1
  • 3