2

This is an extended question of this one: Is std::string suppose to have only Ascii characters

I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.

I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.

How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?

P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)

Community
  • 1
  • 1
Humam Helfawi
  • 19,566
  • 15
  • 85
  • 160
  • 3
    [FYI] `static_cast(my_char - '0')` is guaranteed to work in all character sets C++ uses. – NathanOliver Jun 21 '16 at 20:19
  • @NathanOliver mmm I suspected that.. However, it was just an example.. I will add another one. Thanks – Humam Helfawi Jun 21 '16 at 20:20
  • 1
    @NathanOliver: But not in all charsets that the user may enter. In MOST charsets, the ASCII range of characters is the same. But that is not true in ALL charsets. For example, EBCDIC does not use the same `char` values for ASCII numbers (`'0'` is 0x30 in ASCII, but is 0xF0 in EBCDIC), and EBCDIC does not use sequential ranges for all ASCII letters. So, you have to take the input charset into account when processing it. `std::string` only knows about `char` values, but not what they represent. – Remy Lebeau Jun 21 '16 at 20:24
  • You could make that a requirement of the terminal used... Not sure how far you would get in the long run. Better is to validate the users input and throw on an error (or remove the invalid input) and print a nice usage message. – Niall Jun 21 '16 at 20:27
  • 2
    @remylebeau the standard states that 0-9 must be contiguous in all compliant character sets – NathanOliver Jun 21 '16 at 21:32
  • @Nathan _"he standard states that 0-9 must be contiguous"_ Do you have the cite at hand and like to improve my answer with it? Otherwise we should have a dupe, or you answer yourself including that information. – πάντα ῥεῖ Jun 21 '16 at 22:06
  • @NathanOliver: what the C++ standard dictates about charsets used at compile-time is irrelevant when dealing with charsets used for **user/data input** at runtime, which is the issue of this discussion. – Remy Lebeau Jun 21 '16 at 22:35
  • 1
    @RemyLebeau I have edited πάντα ῥεῖ to include the cite from the standard. C++ requires this to hold true on both the source and execution character sets. That will cover run time. – NathanOliver Jun 21 '16 at 23:50
  • @NathanOliver: No, that doesn't cover runtime data. The *value* of each `char` in a `std::string` can represent anything as long as it fits within the bounds of the `char` type. There is nothing the compiler can do to enforce charset restrictions at runtime. There is nothing to stop runtime data from using 0xF0 instead of 0x30 to represent `'0'` if that is what the data wants to use. That is what charsets are all about - different representations of character data. The charset that the compiler wants to use, and the charset that the runtime data wants to use, are completely separate things. – Remy Lebeau Jun 22 '16 at 00:09
  • @NathanOliver: `my_char` holds runtime data. `my_char - '0'` is only guaranteed to produce values `0-9` when `my_char` holds a value based on a runtime charset that is compatible with the charset the compiler uses to encode characters `'0'`-`'9'` at compile-time. That is *usually* the case, as MOST charsets share the same ASCII range for compatibility, but that is not 100% guaranteed of ALL charsets in existance. So you have to pay attention to the charset of the data at runtime, you *might* have to perform a data conversion before applying the subtraction. – Remy Lebeau Jun 22 '16 at 00:11

2 Answers2

2

How can force the user/OS to input an Ascii string

You cannot, unless you let the user specify the numeric values of such ASCII input.

It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.

You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]

emphasis mine

NathanOliver
  • 171,901
  • 28
  • 288
  • 402
πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190
1

You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.

Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom

Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.

For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.

However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.