wchar_t to unsigned char conversion

Question

I have a code that implements the following:

unsigned char charStr; //this var can only take a value either 0, 1, or 2
WCHAR wcharStr;
...
charStr = wcharStr - '0';
...

I am aware of the fact that you might lose some data (from 16-bit to 8-bit) while making a conversion from Unicode (wchar_t data type) to ANSI (unsigned char). However, can someone explain why substracting '0' make this conversion right ?

It depends on the context. Usually you do `c-'0'` when c is the a digit and you wan to get the representation of the digit. — Pablo, Mar 05 '18 at 01:14
thanks Pablo, I think the case you bring up is exactly the one I mean. Sorry for my lack of knowledge (I am a newbie), but what do you mean with "getting the representation of the digit" — ekremer, Mar 05 '18 at 01:20
I made them up so. The real var names are much longer and they are of little value for the discussion — ekremer, Mar 05 '18 at 01:31
Step 1 please stop thinking of `wchar_t` as being "Unicode", or of `char` as being "ANSI". — Lightness Races in Orbit, Mar 05 '18 at 01:54
@Lightness Races in Orbit thanks, I will keep it in mind. Anything else to comment on the question ? — ekremer, Mar 05 '18 at 02:02
@ekremer when I talk about the representation of a digit, I mean this: 1 is a digit, it's representation is `'1'` which is a numeric constant with value 49 (the value is defined by the ASCII table) — Pablo, Mar 05 '18 at 02:22

Davislor · Accepted Answer · 2018-03-05T03:42:21.923

The C and C++ language standard requires that the encodings for the digits from 0 to 9 be consecutive. Therefore, subtracting '4' - '0', for example, will get you 4.

This is not actually required for wchar_t, but in the real world, your compiler will map that to Unicode, either UTF-16 on Windows or UCS-4 elsewhere. The first 128 code points of Unicode are the same as ASCII. You’re not compiling this code on the one modern, real-world compiler that uses a non-ASCII character set (IBM’s Z-series mainframes, which default to Code Page 1047 for backward compatibility), so your compiler converts your wchar_t and char to some integral type, probably 32 bits wide, subtracts, and gets a digit value. It then stores that in a variable of type unsigned char, which is a mistake because it’s actually the ASCII value of an unprintable control character.

This code is not correct. If you want to convert from wchar_t to char, you should use either codecvt from the STL or wcrtomb() from the C standard library. There is also a wctob() that converts to a single byte if and only if that’s possible. Set your locale before you use them.

If you’re sure that your wchar_t holds Unicode, that your unsigned char holds Latin-1, and your values are within range, however, you can simply cast the wchar_t value to (unsigned char). Another approach, if you know you have a digit, is to write (charStr - L'0') + '0'.

Funny things is that my man page of `wctob` says: *Never use this function. It cannot help you in writing internationalized programs.* — Pablo, Mar 05 '18 at 02:24
@Pablo The justification for that on Linux is, "Internationalized programs must never distinguish single-byte and multibyte characters." So, this code is already breaking that advice anyway. But convert to a multi-byte string for better portability. — Davislor, Mar 05 '18 at 03:16
To make life extra fun, IBM’s Z-series mainframes also use a 2-byte wchar_t. — Peeter Joot, Mar 20 '20 at 16:47

wchar_t to unsigned char conversion

1 Answers1