Characters extracted by istream >> double

Question

#include <iostream>
#include <sstream>
#include <string>

int main()
{
    double d; std::string s;

    std::istringstream iss("234cdefipxngh");
    iss >> d;
    iss.clear();
    iss >> s;
    std::cout << d << ", '" << s << "'\n";
}

I'm reading off N3337 here (presumably that is the same as C++11). In [istream.formatted.arithmetic] we have (paraphrased):

operator>>(double& val);

As in the case of the inserters, these extractors depend on the locale’s num_get<> (22.4.2.1) object to perform parsing the input stream data. These extractors behave as formatted input functions (as described in 27.7.2.2.1). After a sentry object is constructed, the conversion occurs as if performed by the following code fragment:

typedef num_get< charT,istreambuf_iterator<charT,traits> > numget;
iostate err = iostate::goodbit;
use_facet< numget >(loc).get(*this, 0, *this, err, val);
setstate(err);

Looking over to 22.4.2.1:

The details of this operation occur in three stages
— Stage 1: Determine a conversion specifier
— Stage 2: Extract characters from in and determine a corresponding char value for the format expected by the conversion specification determined in stage 1.
— Stage 3: Store results

In the description of Stage 2, it's too long for me to paste the whole thing here. However it clearly says that all characters should be extracted before conversion is attempted; and further that exactly the following characters should be extracted:

any of 0123456789abcdefxABCDEFX+-
The locale's decimal_point()
The locale's thousands_sep()

Finally, the rules for Stage 3 include:

— For a floating-point value, the function strtold.

The numeric value to be stored can be one of:

— zero, if the conversion function fails to convert the entire field.

This all seems to clearly specify that the output of my code should be 0, 'ipxngh'. However, it actually outputs something else.

Is this a compiler/library bug? Is there any provision that I'm overlooking for a locale to change the behaviour of Stage 2? (In another question someone posted an example of a system that does actually extract the characters, but also extracts ipxn which are not in the list specified in N3337).

Update

As pointed out by perreal, this text from Stage 2 is relevant:

If discard is true, then if ’.’ has not yet been accumulated, then the position of the character is remembered, but the character is otherwise ignored. Otherwise, if ’.’ has already been accumulated, the character is discarded and Stage 2 terminates. If it is not discarded, then a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.

If the character is either discarded or accumulated then in is advanced by ++in and processing returns to the beginning of stage 2.

So, Stage 2 can terminate if the character is in the list of allowed characters, but is not a valid character for %g. It doesn't say exactly, but presumably this refers to the definition of fscanf from C99 , which allows:

a nonempty sequence of decimal digits optionally containing a decimal-point character, then an optional exponent part as defined in 6.4.4.2;

a 0x or 0X, then a nonempty sequence of hexadecimal digits optionally containing a decimal-point character, then an optional binary exponent part as defined in 6.4.4.2;

INF or INFINITY, ignoring case

NAN or NAN(n-char-sequence opt ), ignoring case in the NAN part, where:

and also

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.

So, actually the Coliru output is correct; and in fact the processing must attempt to validate the sequence of characters extracted so far as a valid input to %g, while extracting each character.

Next question: is it permitted, as in the thread I linked to earlier, to accept i , n, p etc in Stage 2?

These are valid characters for %g , however they are not in the list of atoms which Stage 2 is allowed to read (i.e. c == 0 for my latest quote, so the character is neither discarded nor accumulated).

[Clang with libc++](http://coliru.stacked-crooked.com/a/3f2adc5accefb54c) produces `0, 'gh'`. This is clang bug [17782](http://llvm.org/bugs/show_bug.cgi?id=17782). — T.C., Jul 11 '14 at 03:32
@TC Updated my question text to focus more specifically on that issue (since perreal's point dealt with my sample code). `i` `n` `p` are valid characters for `%g` , e.g. it could read `"inf"` or `"nan"`, and `p` is the binary point for hexadecimal representations. — M.M, Jul 11 '14 at 03:38
[GCC at ideone.com](http://ideone.com/UIf1kP) produces `234, 'cdefipxngh'`. — Tony Delroy, Jul 11 '14 at 03:40
@MattMcNabb But if `%g` allows hexfloat, then `a` is a valid character. — T.C., Jul 11 '14 at 03:44
@MattMcNabb Well, that depends on whether validity is context-sensitive...GCC doesn't parse `0xABp-4` either — T.C., Jul 11 '14 at 03:48
@TC `p` is not in the list of acceptable atoms, however `0` `x` `A` `B` are, so it seems to me that `0xAB` should be processed (according to the letter of N3337) and `p` left in the stream. — M.M, Jul 11 '14 at 03:49
@MattMcNabb It [just parses `0`](http://coliru.stacked-crooked.com/a/b37e3156ac0ea266). — T.C., Jul 11 '14 at 03:51
@TC OK, so (according to your answer) just parsing `0` instead of `0xAB` is a bug that was addressed by LWG 221; however in accordance with LWG 2381 the situation is still crappy and it would be better if it did read all of `0xABp-4`. — M.M, Jul 11 '14 at 03:57
@MattMcNabb I think LWG 221 was more intended to make `%i` work, and gcc simply didn't implement hexfloat support for whatever reason. — T.C., Jul 11 '14 at 04:06

T.C. · Accepted Answer · 2014-07-11T04:20:28.703

5

This is a mess because it's likely that neither gcc/libstdc++'s nor clang/libc++'s implementation is conforming. It's unclear "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1" means, but I think that the use of the phrase "next character" indicates that check should be context-sensitive (i.e., dependent on the characters that have already been accumulated), and so an attempt to parse, e.g., "21abc", should stop when 'a' is encountered. This is consistent with the discussion in LWG issue 2041, which added this sentence back to the standard after it had been deleted during the drafting of C++11. libc++'s failure to do so is bug 17782.

libstdc++, on the other hand, refuses to parse "0xABp-4" past the 0, which is actually clearly nonconforming based on the standard (it should parse "0xAB" as a hexfloat, as clearly allowed by the C99 fscanf specification for %g).

The accepting of i, p, and n is not allowed by the standard. See LWG issue 2381.

The standard describes the processing very precisely - it must be done "as if" by the specified code fragment, which does not accept those characters. Compare the resolution of LWG issue 221, in which they added x and X to the list of characters because num_get as then-described won't otherwise parse 0x for integer inputs.

Clang/libc++ accepts "inf" and "nan" along with hexfloats but not "infinity" as an extension. See bug 19611.

edited Jul 11 '14 at 04:20

answered Jul 11 '14 at 03:50

T.C.

133,968
17
288
421

Good find about LWG 2381 ; that resolution seems like a good idea to me -- it should mirror the behaviour of `strtold`; it doesn't make sense to me that Stage 2 should specify the list of atoms and yet still require checking against `%g` at each step. – M.M Jul 11 '14 at 03:55
@MattMcNabb The problem is that `strtold` has the whole string to work with, while `num_get` must be single pass, and can't (or shouldn't) read beyond the first invalid character. Makes the thing really difficult to specify right. – T.C. Jul 11 '14 at 04:10
1

Yes. The clause pointed out by perreal says that if the whole thing so far is acceptable as an input, then accept the character. I don't see how we can do better than this (given that multiple put-backs into the stream may not be supported). Unfortunately it would mean that "Nappies" has to have the "Na" extracted and discarded, (it could have been the start of `"NaN"`) – M.M Jul 11 '14 at 04:13
In hindsight maybe it would have been better to read up to the next whitespace (or at least have the atoms list include all `isalnum` characters), and not have the perreal clause. That would seem more limiting at first, but actually more practical taking into consideration the corner cases we've looked at here. – M.M Jul 11 '14 at 04:26
Yeah I'm saying that doing that from day one of C++ might have worked out better. Not possible to make such a change now. – M.M Jul 11 '14 at 04:30
@MattMcNabb I kinda suspect that even very old `scanf` implementations would parse `"8hello"` with a `%d` to produce `8`, so that'd still produce a C compatibility issue ("Why'd I use this newfangled C++/iostreams thing? It can't even parse an integer right!"). – T.C. Jul 11 '14 at 04:32

score 4 · Answer 2 · edited Jun 20 '20 at 09:12

4

At the end of stage 2, it says:

If it is not discarded, then a check is made to determine if c is allowed as the next character of an input ﬁeld of the conversion speciﬁer returned by Stage 1. If so, it is accumulated.

If the character is either discarded or accumulated then in is advanced by ++in and processing returns to the beginning of stage 2.

So perhaps a is not allowed in the %g specifier and it is not accumulated or ignored.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 11 '14 at 03:18

perreal

94,503
21
155
181

You're right.I have updated my question taking your point into account – M.M Jul 11 '14 at 03:35

Characters extracted by istream >> double

Update

2 Answers2

Linked