C++20 changes reading into char arrays with `operator>>` - How to fix this?

Question

EDIT: To summarize from the comments (before I close the topic):

the issue has been discussed here previously:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0487r1.html

The resolution was that the committee was aware they break code with this.

I don't know what urgent issue lead the LWG to replace the old version (that was in the standard between C++98 and C++17 ie. for about 20 years) and not did it this way (as it IMHO was done with std::gets in C++14 - for a good reason):

Step 1: Only ADD the new templated version expecting a "reference to an array" from which at compile time the number of elements is deduced (which is surely is a protection against buffer overruns). It would have taken effect "from day one".

Step 2: deprecate the version that was in the standard from C++98 to C++17 and wait for community feedback if there use cases valid enough to keep it. Then, maybe remove it one standard later.

I think a valid use case is what I show here: https://godbolt.org/z/nG174vnqP

(extracted from real code but shortened to show the issue only). Contained is also a little demo why I think the old and the new version could well have co-existed. But maybe I'm wrong with that assessment.

What I find currently most annoying is there is no way to resolve the issue in C++20 without stepping into UB-land. Especially as I think the "old" version is still available internally and the new version just forwards to it - which is highly probably as you don't want a separate implementation for each different array length.

With the release of C++20, the operator>> overload for reading a char array now expects a char(&)[N] argument instead of a char*. The original code that compiled correctly since C++98, which looks like the following, will not work any more:

std::size_t sz = 10;
char *cp = new char[sz];
...
std::cin >> std::setw(sz) >> cp;

To correct this, the code can be modified as follows:

std::cin >> std::setw(sz) >> *reinterpret_cast<char(*)[std::numeric_limits<int>::max()]>(cp));

See here: https://godbolt.org/z/svPcT4eao

Additionally, there's an issue with a common implementation of variable-length strings that can silently change without indication at compile time.

To answer the generally asked question in the comments why I don't use std::string:

In fact, I use std::string a lot but I also occasionally coach people who work in projects where you don't want to add any unnecessary overhead and some prefer string classes that don't use three pointers when a single one is sufficient. The example is extracted from one of those.

Also I was pointed to this Can't use std::cin with char* or char[] in C++20 answer and yes, it is about the same topic but the even more important information in this answer is in an LWG it points to: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0487r1.html

It makes clear that valid C++17 code doesn't compile any longer but sadly it does't cover the silent change where code valid before isn't valid any longer but will not cause a compile time error:

At least given the

    struct vbuf {
        std_::size_t sz;
        char cbuf[1];
    };

technique with over-allocation hasn't been turned into UB by an earlier C++ standard.

In the comment below @n.m. remarked that this was already UB in C. He may be correct (I not checked all the C standard since C89 and I'm relatively sure it was not UB then) but at least it is a common technique (eg. in the buffers for Linux messages, see sndmsg(3P) etc.) and therefore I think it is a safe assumption - at least for the Linux family of compilers - this is well defined ans safe.

But I will not any longer make the claim the new version causes a "silent change" because this of course not applies if we are in UB-land.

Why do you need to `new[]` an array of size 10 in the first place? Just declare it as `char cp[10];` instead — UnholySheep, Jun 16 '23 at 13:47
Or if `sz` is actually variable then you should be using a class like `std::string` instead — UnholySheep, Jun 16 '23 at 13:49
Does this answer your question? [Can't use std::cin with char\* or char\[\] in C++20](https://stackoverflow.com/questions/62194484/cant-use-stdcin-with-char-or-char-in-c20) — 康桓瑋, Jun 16 '23 at 14:17
Any reason you're not using `std::string` instead of the array? — HolyBlackCat, Jun 16 '23 at 14:23
first of all, the example code is an extract from a string class and concentrates on the problem at hand; there are valid reasons not to use std::string, one of which is you don't want the overhead it creates (typically 3 pointers per string instead of just one you need for Java-Like strings that cannot change during their lifetime); also, on small embedded systems you either don't use any heap memory at all or you limit the use of heap memory to an initialization phase; which means you don't even need a full heap memory management; — Martin Weitzel, Jun 16 '23 at 14:28
@MartinWeitzel: But in this case the data is coming from external input, so every example you cited doesn't apply. The old code was super dangerous and wildly prone to attacks and undefined behavior, and has been removed. You should not work around it to reenable it, because that would reopen your program to attacks and undefined behavior. — Mooing Duck, Jun 16 '23 at 15:01
@Mooning Duck which problem do you see despite `std::setw(sz)` is included in my solution? I code in C since the mid 1980s and in C++ since 1990, usually with meticulous unit testing and therefore I'm curious which tests I have missed that would have pointed out a potential problem. (IoW: do you assume I do NOT test my code for buffer overruns?) — Martin Weitzel, Jun 16 '23 at 15:30
@MooingDuck sorry for misspelling your name in the previous comment. — Martin Weitzel, Jun 16 '23 at 15:38
I was totally unaware that `width` did anything for istreams, but you're right! https://en.cppreference.com/w/cpp/io/basic_istream/operator_gtgt2 — Mooing Duck, Jun 16 '23 at 15:41
@MooingDuck thanks again. What I learned meanwhile from https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0487r1.html is that the fact valid code will not any longer compile after this change was predicted but obviously they not thought of the SILENT CHANGE pointed out in my linked example code. So be it, I must live with the extremely ugly reinterpret_cast now. But some C++ projects using a C-compatible light-weight string class based on over-allocation for struct with a size as first member and a single element array as second may be bitten by it. I need to warn my customers. — Martin Weitzel, Jun 16 '23 at 16:18
It is not hard to write an extractor that doesn't invoke UB like your cast to a huge array does. Construct a sentry, read characters one by one, stop when a whitespace is seen. But perhaps more importantly, no one should ever use `>>` for string-like things, the stop-on-whitespace behaviour does more harm than good. — n. m. could be an AI, Jun 16 '23 at 16:20
"based on over-allocation for struct with a size as first member" this is and always have been UB in C++. — n. m. could be an AI, Jun 16 '23 at 16:20
I'd make a `into_buffer` helper wrapper class so you can do `cin >> into_buffer(cp, sz)`, which would encapsulate all the messy and verbose behavior into one spot. — Eljay, Jun 16 '23 at 16:26
@n.m. I'm curious what UB you're thinking of. Will you let me know? Is that "huge number" used for anything else but to do deduce the N in the (&)T[N] template? Which in turn is probably just fed into the `std::min` ending up setting the width for input. And why shouldn't I use `operator<<` to extract white-space separated items from an `std::istringstream` which I constructed from an `std::string` which I read before from a file using `std::getline` (the free standing function, not the `std::istream` member function). Do you propose low-level string searches for this or regular expressions? — Martin Weitzel, Jun 16 '23 at 16:31
The cast itself is not UB, but accessing an object via an lvalue of a wrong type is. Since we don't know how exactly `operator>>` uses the reference passed to it, we cannot guarantee anything about it. — n. m. could be an AI, Jun 16 '23 at 16:40
@Eljay thanks for responding. Of course I would encapsulate that ugly reinterpret_cast in a wrapper. If alone for the reason I would never use a reinterpret_cast without an exhaustive comment explaining why it's unavoidable. But that doesn't help in the case of the silent change I pointed out, when C++20 compiles valid C++17 code - in fact code valid since C++98 - and gives it a different behavior. (See the second example in the code I linked.) — Martin Weitzel, Jun 16 '23 at 16:44
One should avoid using `operator>>` to extract white-space separated items because (1) whitespace is a vague concept that depends on locale, and (2) data is whitespace-separated right until the day one discovers that whitespace is valid inside the data. To read data separated by exactly the space character `' '`, one can use `basic_istream::getline` with the delimiter argument. At least it is trivial to change the delimiter when need arises (as opposed to fiddling with locale facets). — n. m. could be an AI, Jun 16 '23 at 16:51
Every standard of C++ that has come out has had some breaking changes from previous standards. WG21 tries hard for backwards compatibility, but have never had 100% success with that. (Not counting the defect reports of the standard, some of which were retroactively applied to early standards.) — Eljay, Jun 16 '23 at 17:26
@n.m. so, if "over-allocation has always been UB in C++", then each time I compile with a C++ compiler some code written in the common subset of C and C++ (common wrt. syntax and the library functions available) using - say - Linux messages (`msgrcv`, `msgsnd` etc.) I'm already with one foot in UB-land? — Martin Weitzel, Jun 16 '23 at 17:31
@Eljay I understand you and I can well accept `std::gets` is gone forever (since C++14) because it's easy to replace and the result is safer. But that `istream::getline` stayed despite it can NOT be made "safe at compile time" shows there are use cases for giving a size and (char) pointer as argument and such exist for the functionality previously provided with `operator>>` for a size (set with `std::setw`) and a char pointer too. — Martin Weitzel, Jun 16 '23 at 17:49
It is UB is C as well. What isn't UB in C is flexible array member, which was added to C *exactly because* the 1-sized array technique is UB, but C++ doesn't have any such thing. And yes, there is lots and lots of widely used code with UB, does it come to you as a surprise? — n. m. could be an AI, Jun 16 '23 at 18:02
Changes to the C++ standard follow a process. You can submit a proposal to remove `istream::getline` on the basis that it can not be made safe at compile time. — Eljay, Jun 16 '23 at 19:31
@n.m. I rather put it as Clive Feather did it once, long ago (not knowing your age I can't say if you read that "C Users Journal in the early 1990th): "UB is the chance to convince your favorite compiler vendor the behavior YOU want to have is what is best for the majority of the developers who use that compiler" IoW: UB is put in the standard where the ISO/ANSI committee members didn't want to make a decision in a non-trivial trade-off (eg. between "robustness" and "performance") — Martin Weitzel, Jun 16 '23 at 20:32
If you want to have UB in your code, then who am I to judge? — n. m. could be an AI, Jun 16 '23 at 20:46
@Eljay now you disappoint me :-) If you really followed my thought process I would have expected you say I should submit a proposal to have BOTH, the "reference to array of N" version AND the one - to which the first probably delegates anyways after setting the width - which used a plain pointer and relies on the user sets the width. I currently see no reason why we can have only one of both but maybe I've just to insufficient experience C++ template programming. — Martin Weitzel, Jun 16 '23 at 20:46
@n.m. and I think you didn't get the gist of what Clive Feathers said some 30 years ago. But who am I to judge. I'm NOT arguing in favor of UB and I'm not arguing against the "reference to array of N" template based version. I'm arguing against REMOVING the pointer based version. — Martin Weitzel, Jun 16 '23 at 20:49
@n.m. "What isn't UB in C is flexible array member, which was added to C exactly because the 1-sized array technique is UB" YES, you are right! But only if we access that array through an indexing operation, which a compiler (or standard conforming interpreter) might take as chance to do index checking and once it detects an OOB index we are in UB-land. But this is not necessarily the case. What happens is that (1) it relies on the decay of an array name to a pointer and (2) that heap memory delivers a contiguous block memory. Therefore the "over-allocation" as used in message buffers is OK. — Martin Weitzel, Jun 16 '23 at 23:04

score -1 · Answer 1 · answered Jun 18 '23 at 19:09

The issue regarding the changes in C++20 for reading into char arrays with the operator>> overload has been discussed in the following document: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0487r1.html. The resolution reached by the committee was that they were aware that these changes could break existing code.

It's unclear why the Library Working Group (LWG) decided to replace the old version of the operator>> that was present in the standard for approximately 20 years (between C++98 and C++17) without following a different approach, as was done with std::gets in C++14. One possible approach could have been:

Step 1: Only add the new templated version that expects a "reference to an array" as a parameter, from which the number of elements is deduced at compile time. This approach provides protection against buffer overruns and could have been effective from the beginning.

Step 2: Deprecate the version of the operator>> that was present in the standard from C++98 to C++17 and gather community feedback to determine if there are valid use cases to keep it. Then, potentially remove it in a future standard.

One example use case demonstrating the issue can be found here: https://godbolt.org/z/nG174vnqP. The example is extracted from real code but has been shortened to highlight the issue. Additionally, it includes a small demonstration of why the old and new versions could have coexisted.

The current situation is frustrating because there is no straightforward way to resolve the issue in C++20 without stepping into undefined behavior territory. It is worth noting that the "old" version of the operator>> might still be available internally, with the new version simply forwarding to it. This assumption is based on the assumption that separate implementations for different array lengths would not be desirable.

With the release of C++20, the operator>> overload for reading into a char array now expects a char(&)[N] argument instead of a char*. To address this, the code can be modified as follows:

std::cin >> std::setw(sz) >> *reinterpret_cast<char(*)[std::numeric_limits<int>::max()]>(cp);
An example demonstrating this modification can be found here: https://godbolt.org/z/svPcT4eao.

There is also an issue related to common implementations of variable-length strings, which can silently change without indication at compile time.

Regarding the question of why std::string is not used, it's important to note that std::string is widely used, but there are cases where developers prefer string classes that do not introduce unnecessary overhead. The provided example is extracted from one such case.

In a comment, it was mentioned that a similar topic is covered in the answer to the question "Can't use std::cin with char* or char[] in C++20" (https://stackoverflow.com/questions/49918992/cant-use-stdcin-with-char-or-char-in-c20). The important information in that answer is contained in an LWG document linked within it: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0487r1.html.

It's worth noting that while this technique of over-allocation using the struct vbuf might have been considered safe and well-defined in earlier C++ standards, it may now be considered undefined behavior.

This looks like it contains mostly AI-generated text. – tchrist Jun 30 '23 at 13:10 — tchrist, Jun 30 '23 at 13:10

C++20 changes reading into char arrays with `operator>>` - How to fix this?

1 Answers1