1

I was just playing with characters using a very simple C++ program, let me explain the situation -:

#include<iostream>

int main(){

    char c;
    std :: cin >> c;
    std :: cout << "The integer value of character entered is : " <<  int(c) << '\n';

    int m = 12 + 'á';

    std :: cout << m << '\n';

    return 0;
}

now when I execute the above program I enter the value of c as 'á' which is in the spanish character set and is typed as "Alt + 160" in windows and because my computer implements the plain old char as a signed char the above program outputs the integer value of 'á' as -96, but a strange thing happens when I output the value of m it returns the output as -19 instead of -84, while if I execute the following program -:

#include<iostream>

int main(){

    signed char c;
    std :: cin >> c;
    std :: cout << "The integer value of character entered is : " <<  int(c) << '\n';

    int m = 12 + c;

    std :: cout << m << "\n";

    return 0;
}

I get the correct output value, now I am confused as to why is this happening, if every character is backed by some number in the computer then why is not the expression m = 12 + 'á' evaluated as m = 12 + (-96). Kindly enlighten me regarding this issue. I am using Windows 7 and Dev C++

AnkitSablok
  • 3,021
  • 7
  • 35
  • 52
  • 2
    On many systems the `á` is represented in [UTF-8](http://en.wikipedia.org/wiki/UTF-8) by *two* bytes.... – Basile Starynkevitch May 05 '13 at 10:48
  • 1
    what do you get with `cout << (int)('á');` ? – Arne Mertz May 05 '13 at 10:48
  • @ArneMertz : I get -31 as the value – AnkitSablok May 05 '13 at 10:53
  • @BasileStarynkevitch : I think irrespective of the character set the value of int(c) and int('á') must be the same in case c and 'á' are the same characters isn't it ? – AnkitSablok May 05 '13 at 10:58
  • @BasileStarynkevitch: On Windows, console mode does not support wide characters, it uses an 8 bit character set and code pages. – Clifford May 05 '13 at 11:23
  • 1
    I have just said that 160 is not a code for á... well I was wrong, it is, in code page CP437 aka DOS. In the Windows (CP1252) code page, á is 225, but apparently Windows does not use Windows code page in the console. Your editor may or may not use CP437, CP1252, UTF8, or anything else. Look at your program in a hex editor to be certain. Better yet, never use anything but plain 7-bit ASCII in your program text, especially on Windows. These things will come and bite you. When you work with text in your program, always make sure a correct encoding is used. This is NOT simple. – n. m. could be an AI May 05 '13 at 11:35
  • @n.m. make that an answer. Console -> CP437 -> 160/-96. Source code -> CP1252 -> 225/-31 – Arne Mertz May 05 '13 at 12:11
  • @Clifford: UTF-8 encoding is *not a wide* `wchar_t` encoding. It may use several consecutive bytes (i.e.`char`) to encode one single glyph. – Basile Starynkevitch May 05 '13 at 12:12
  • @Coder: no, in terms of standard C++, `á` is not a valid char. char values go from 0 to 127 and cover ASCII only. That you can use `á` in your source files is an implementation specific addition of your compiler. How the input `á` from console is interpreted, is *another* implementation and OS specific thing. That the two won't match is due to the different character encodings used. Welcome to one of the "dirty" corners of computing. – Arne Mertz May 05 '13 at 12:16
  • @all : about 'á' not being a valid C++ character, I guess the range of characters being -127 to 127 or 0 to 255 i.e signed or unsigned is highly implementation dependent and in case of my machine I guess the char is implemented as a signed char, I don't think that then there should be such an issue, well the integer value of the character 'á' if it has to be -31 then even if its stored in a variable it shouldn't vary or should it? – AnkitSablok May 05 '13 at 13:28
  • 1
    @Coder: the *source* character set and the *execution* character set need not be the same (and at runtime, *valid characters* that can be communicated with I/O routines are not necessarily the same as *possible `char` data type values*, but that's a different can of worms). – n. m. could be an AI May 05 '13 at 14:08
  • @BasileStarynkevitch: Sure, I was using the term "wide character" in a more general sense to mean not a fixed 8-bit per character encoding. The fact remains that Windows console mode does not support UTF-8 or indeed any Unicode encoding. – Clifford May 05 '13 at 20:00
  • It does support UTF-16 through `WriteConsoleW`. – dan04 May 07 '13 at 02:29

1 Answers1

2

I have just said that 160 is not a code for á... well I was wrong, it is, in code page CP437 aka DOS. In the Windows (CP1252) code page, á is 225, but apparently Windows does not use Windows code page in the console.

Your editor may or may not use CP437, CP1252, UTF8, or anything else. Look at your program in a hex editor to be certain. Better yet, never use anything but plain 7-bit ASCII in your program text, especially on Windows but in general everywhere else too. These things are not portable even between different computers running the same version of the same OS, and are not allowed by the standard. They will come and bite you. If you need non-ASCII character strings in your program, read them from a data file, never embed them in the source.

When you work with text in your program, always make sure a correct encoding is used. This is NOT simple, especially under Windows when using Visual Studio and standard C and/or C++ I/O libraries. I was not able to make this combination work with UTF-8.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243