2

I have a code that implements the following:

unsigned char charStr; //this var can only take a value either 0, 1, or 2
WCHAR wcharStr;
...
charStr = wcharStr - '0';
...

I am aware of the fact that you might lose some data (from 16-bit to 8-bit) while making a conversion from Unicode (wchar_t data type) to ANSI (unsigned char). However, can someone explain why substracting '0' make this conversion right ?

ekremer
  • 311
  • 4
  • 23

1 Answers1

6

The C and C++ language standard requires that the encodings for the digits from 0 to 9 be consecutive. Therefore, subtracting '4' - '0', for example, will get you 4.

This is not actually required for wchar_t, but in the real world, your compiler will map that to Unicode, either UTF-16 on Windows or UCS-4 elsewhere. The first 128 code points of Unicode are the same as ASCII. You’re not compiling this code on the one modern, real-world compiler that uses a non-ASCII character set (IBM’s Z-series mainframes, which default to Code Page 1047 for backward compatibility), so your compiler converts your wchar_t and char to some integral type, probably 32 bits wide, subtracts, and gets a digit value. It then stores that in a variable of type unsigned char, which is a mistake because it’s actually the ASCII value of an unprintable control character.

This code is not correct. If you want to convert from wchar_t to char, you should use either codecvt from the STL or wcrtomb() from the C standard library. There is also a wctob() that converts to a single byte if and only if that’s possible. Set your locale before you use them.

If you’re sure that your wchar_t holds Unicode, that your unsigned char holds Latin-1, and your values are within range, however, you can simply cast the wchar_t value to (unsigned char). Another approach, if you know you have a digit, is to write (charStr - L'0') + '0'.

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • Funny things is that my man page of `wctob` says: *Never use this function. It cannot help you in writing internationalized programs.* – Pablo Mar 05 '18 at 02:24
  • @Pablo The justification for that on Linux is, "Internationalized programs must never distinguish single-byte and multibyte characters." So, this code is already breaking that advice anyway. But convert to a multi-byte string for better portability. – Davislor Mar 05 '18 at 03:16
  • 1
    To make life extra fun, IBM’s Z-series mainframes also use a 2-byte wchar_t. – Peeter Joot Mar 20 '20 at 16:47