5

Is std::string supposed to hold a set of characters in Ascii encoding on all platforms and standard compilers?

In other words, can I be sure that my C++ program will get a set of Ascii characters if I do this:

std::string input;
std::getline(std::cin, input);

EDIT:

In more accurate words, I want to make sure that if the user enter "a0" I will get a std::string with two elements. The first one is 97 and the second is 48

Andrew Brēza
  • 7,705
  • 3
  • 34
  • 40
Humam Helfawi
  • 19,566
  • 15
  • 85
  • 160
  • 2
    There's absolutely no guarantee. UTF-8 is a very popular character encoding, and if you type "á0" on such a system your string will contain *three* elements. – Mark Ransom Jun 21 '16 at 20:10
  • @MarkRansom I see.. I will post another question about how can I force or ensure that an Ascii string would be inputed. Thanks – Humam Helfawi Jun 21 '16 at 20:11
  • "I have a variable `std::string xml`. Does the compiler or the STL enforce that there is only XML strings inside?" - No. The type is `char' not "XML" or "Unicode". Don't confuse type, format or encoding. There is a valid question in there though: "How can I control the standard IO encoding?" – Fozi Jun 21 '16 at 20:28
  • @Fozi yes you are right.. and I have asked this http://stackoverflow.com/questions/37953843/how-can-force-the-user-os-to-input-an-ascii-string – Humam Helfawi Jun 21 '16 at 20:28
  • @HumamHelfawi: I think the correct behavior is, *validate* that the input contains only ASCII if that is your precondition, which is easy to do, and fail with a clear error message if the input doesn't meet your conditions. I don't think you can go "back" from unicode characters to ASCII representing the user's key strokes -- that would probably be extremely difficult. If you are really asking how to reconfigure the terminal so that it behaves differently, I think that will also be hard and will be platform dependent. – Chris Beck Jun 22 '16 at 00:51

3 Answers3

10

No. std::string does not hold "characters"; it holds bytes.

Those bytes could form some human-readable string through encoding as ASCII or EDBCIC or Unicode. They could be a binary encoding storing computer-readable information (e.g. a JPEG image). They could be guidelines from aliens on how to use Stack Overflow for three weeks straight without being downvoted even once. They could be total random white noise.

Your program needs to be made to understand what the data it is reading actually means, and how it is encoded. This shall be part of your task as the programmer.

(It's unfortunate, and nowadays misleading, that char is named char.)

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    "without being downvoted " I happy now that there is no guarantee. At least, I may found the guidelines someday – Humam Helfawi Jun 21 '16 at 19:56
  • @HumamHelfawi: Assuming you can write a program to decode those the guidelines ;) – Lightness Races in Orbit Jun 21 '16 at 19:56
  • Are you saying a `char` is a `btye` even if `char` is a signed type? – R Sahu Jun 21 '16 at 20:17
  • `char` is short for "character" which is wholly misleading (hence this question); `byte` is literally what the type is, despite yes okay there is a variant with its own meaningful upper bit. But um let's just ignore signed chars ;) Let's say that `signed char` (and `char` on a default-signed-`char`-platform) is a `CHAR_BIT`-bit integer and be done with it ;) But never a "character". – Lightness Races in Orbit Jun 21 '16 at 22:49
  • From http://en.cppreference.com/w/c/language/type: *The type `char` is not compatible with `signed char` and not compatible with `unsigned char`.* It makes me think that a `char` is not compatiblw with `byte`, unless `char` and `byte` are exactly the same type. Are there any subtle points I am missing? – R Sahu Jun 23 '16 at 16:05
  • @RSahu: I do not understand your question. C++ does not actually have a type named `byte`, so there's nothing to be compatible with. The point is that the word "byte" should have been used instead of "char", because these are bytes, not characters. (Indeed, of course, in C++ conversion terms a type `byte` would be incompatible with type `signed byte` and `unsigned byte`, just like now; though that has nothing to do with the question.) – Lightness Races in Orbit Jun 23 '16 at 16:24
  • @LightnessRacesinOrbit, I think I understand what you are trying to say :) `char`, `unsigned char`, and `signed char` could easily have been replaced by `byte`, with `byte` always being an `unsigned` type. – R Sahu Jun 23 '16 at 16:29
  • @RSahu: No all I'm saying is that they should have `s/char/byte/` when naming these types. Or maybe `byte` for unsigned and `short short int` (lol) for signed. And dump the ambiguous one. – Lightness Races in Orbit Jun 23 '16 at 16:30
  • @LightnessRacesinOrbit, I disagree with you. `char` represents a different abstraction than a `byte`. `'a'` has more meaning than `97` (ASCII encoding) and `129` (EBCDIC encoding). – R Sahu Jun 23 '16 at 16:35
  • @RSahu: Yes, it does represent a different abstraction than a byte, and that's why it's wrong, because these objects _do not have any encoding_. Some object of type `char` is not "abstract" and the type `char` suggests/implies/requires no encoding. That's why they're not characters! They're just bytes, with some numerical value. Any encoding is purely application-determined. Except literals, I'll grant you. – Lightness Races in Orbit Jun 23 '16 at 16:44
  • @LightnessRacesinOrbit, true. If I saved the string `"abcd"` in a binary file in a platform that uses ASCII encoding and then read the file in a platform that uses EBCDIC encoding, I won't get back `"abcd"`. Damn! – R Sahu Jun 23 '16 at 16:47
3

No, there is no guarantee that

std::string input;
std::getline(std::cin, input);

will return only ASCII characters. The range of values that be held by a char is not limited to the ASCII characters.

If your platform uses different encoding than ASCII, you'll obviously get a different set of characters.

Even if your platform uses ASCII encoding, if char on the platform is an unsigned type, than it can very easily hold the extended ASCII characters too.

R Sahu
  • 204,454
  • 14
  • 159
  • 270
  • Thanks.. What can I do if I want the input to be treated as Ascii? just a link will help if you do not mind. (I am afraid of looking myself because of the tons of wrong and non mature contexts) – Humam Helfawi Jun 21 '16 at 19:51
  • @HumamHelfawi, are you asking how you can prevent non-ASCII characters from being read into `input`? – R Sahu Jun 21 '16 at 19:52
  • In more accurate words, I want to make sure that if the user enter "a0" I will get a string with two elements. The first one is 97 and the second is 48 – Humam Helfawi Jun 21 '16 at 19:54
  • @HumamHelfawi, I am afraid you'll have to write code to do that if your platform uses a different character encoding. If you platform uses ASCII encoding, you will get that by default. – R Sahu Jun 21 '16 at 19:57
  • 4
    What a `std::string` can hold or `std::cin` can read need not even have anything to do with the "platform's encoding", or with ASCII, or with extended ASCII. Try piping in the result of `dd`, or a `cat someimage.jpg` and you'll see :) The correct answer is that _`std::string` has no notion of encoding at all_. And neither does `std::cin`. – Lightness Races in Orbit Jun 21 '16 at 20:01
  • It doesn't really matter if `char` is signed or unsigned, it will still hold an extended character set, more than just ASCII. – Mark Ransom Jun 21 '16 at 20:06
  • @MarkRansom, a `char` cannot hold a character from the extended ASCII character set if it is a signed type. Enlighten me if I am mistaken. – R Sahu Jun 21 '16 at 20:12
  • Characters are defined by a bit pattern, and all the bit patterns exist in both signed and unsigned types. The extended characters become negative numbers when signed, but that doesn't affect the ability to store them. – Mark Ransom Jun 21 '16 at 20:27
3

In other words, can I be sure that my C++ program will get a set of Ascii characters if I do this ...

No. std::string is actually a specialization for std::basic_string<>, like
using std::string std::basic_string<char>;:

template< 
    class CharT, 
    class Traits = std::char_traits<CharT>, 
    class Allocator = std::allocator<CharT>
> class basic_string;

and can hold any type of character that is defined with Traits.

In short std::string can contain ASCII character encodings, as well as EBCDIC, or any others. But it should be transparent as how you're using it.

πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190