9

The following program demonstrates an inconsistency in the way that std::istream (specifically in my test code, std::istringstream) sets eof().

#include <sstream>
#include <cassert>

int main(int argc, const char * argv[])
{
    // EXHIBIT A:
    {
        // An empty stream doesn't recognize that it's empty...
        std::istringstream stream( "" );
        assert( !stream.eof() );        // (Not yet EOF. Maybe should be.)
        // ...until I read from it:
        const int c = stream.get();
        assert( c < 0 );                // (We received garbage.)
        assert( stream.eof() );         // (Now we're EOF.)
    }
    // THE MORAL: EOF only happens when actually attempting to read PAST the end of the stream.

    // EXHIBIT B:
    {
        // A stream that still has data beyond the current read position...
        std::istringstream stream( "c" );
        assert( !stream.eof() );        // (Clearly not yet EOF.)
        // ... clearly isn't eof(). But when I read the last character...
        const int c = stream.get();
        assert( c == 'c' );             // (We received something legit.)
        assert( !stream.eof() );        // (But we're already EOF?! THIS ASSERT FAILS.)
    }
    // THE MORAL: EOF happens when reading the character BEFORE the end of the stream.

    // Conclusion: MADNESS.
    return 0;
}

So, eof() "fires" when you read the character before the actual end-of-file. But if the stream is empty, it only fires when you actually attempt to read a character. Does eof() mean "you just tried to read off the end?" or "If you try to read again, you'll go off the end?" The answer is inconsistent.

Moreover, whether the assert fires or not depends on the compiler. Apple Clang 4.1, for example, fires the assertion (raises eof() when reading the preceding character). GCC 4.7.2, for example, does not.

This inconsistency makes it hard to write sensible loops that read through a stream but handle both empty and non-empty streams well.

OPTION 1:

while( stream && !stream.eof() )
{
    const int c = stream.get();    // BUG: Wrong if stream was empty before the loop.
    // ...
}

OPTION 2:

while( stream )
{
    const int c = stream.get();
    if( stream.eof() )
    {
        // BUG: Wrong when c in fact got the last character of the stream.
        break;
    }
    // ...
}

So, friends, how do I write a loop that parses through a stream, dealing with each character in turn, handles every character, but stops without fuss either when we hit the EOF, or in the case when the stream is empty to begin with, never starts?

And okay, the deeper question: I have the intuition that using peek() could maybe workaround this eof() inconsistency somehow, but...holy crap! Why the inconsistency?

OldPeculier
  • 11,049
  • 13
  • 50
  • 76
  • Could you please specify what compiler? Your `EXHIBIT B:` behavior seems like a bug to me. – Jesse Good Nov 02 '12 at 23:24
  • 2
    @JesseGood: It seems like correct behavior for most implementations: The stream didn't try to read past EOF just, yet. Only once it tried to read past the last character is it required to set `eof()`. It is allowed to set `eof()` earlier, though. – Dietmar Kühl Nov 02 '12 at 23:31
  • 2
    @DietmarKühl: `It is allowed to set eof() earlier, though.` Hmm, are you really sure about that? – Jesse Good Nov 02 '12 at 23:39
  • Works for me (runs to completion) with multiple compilers. – David Hammen Nov 02 '12 at 23:41
  • @DietmarKühl: where do you see that it can set eof() earlier? Remember, the OP uses get(), which reads exactly one character so that there is no ambiguity. – rici Nov 02 '12 at 23:51
  • @JesseGood: The standard doesn't mandate the use of `sgetc()`, `snextc()`, or `sbumpc()`. Depending on how the character is looked at exactly, it may set `eof()` potentially earlier. – Dietmar Kühl Nov 02 '12 at 23:57
  • @ric: The comment was more general than the call to `std::istream::get()`. That said, depending on how the character is looked at, it may touch `eof()`: if the character is looked at using `sgetc()` followed by a call to `snextc()` it may touch EOF. If the character is looked at using `sbumpc()` it won't. Also, constructing an `std::istringstream` with an empty string may or may not have `std::ios_base::eofbit` set. – Dietmar Kühl Nov 03 '12 at 00:01
  • In case people want the blanket statement: 27.7.2.1 [istream] paragraph 2: "... Both groups of intput functions are described as if they obtain (or extract) input characters by calling `rdbuf()->sbumpc()` or `rdbuf()->sgetc()`. They may use other public members of `istream`." And paragraph 3 goes on that `eofbit` is set (unless explicitly specified otherwise) if the function resturn `traits::eof()`. – Dietmar Kühl Nov 03 '12 at 00:05
  • @DietmarKühl: I think it is impossible for `traits::eof()` to be returned by `sgetc()` or `sbumpc()` in the OP's example because they **both** return `traits::to_int_type(*gptr())` which is not pointing at EOF. – Jesse Good Nov 03 '12 at 00:12
  • @JesseGood: So? The quoted paragraph gives the implementation permission to use something else which, e.g., calls a combination of `sgetc()` and `snextc()`. The latter attempts access to the character after the end. I thought there was a stronger statement to that effect as well but I don't find it. In any case, `get()` doesn't even state how the character is extracted, i.e., `sgetc()` followed by `snextc()` is a perfectly valid implementation although I would use `sbumpc()`. – Dietmar Kühl Nov 03 '12 at 00:33
  • @DietmarKühl: I think that quote needs to be read in conjunction with the next paragraph, which says "If rdbuf()->sbumpc() or rdbuf()->sgetc() returns traits::eof(), then the input function, except as explicitly noted otherwise, completes its actions and does setstate(eofbit)". I don't believe the implementation is given licence to set the `eofbit` in any other circumstance. I suppose it might call `snextc()` after it calls `sgetc()`, but it must still set `eofbit` based on what `sgetc()` returns, and `sgetc()` cannot return `traits::eof()` if there is a character available to return. – rici Nov 03 '12 at 03:47
  • The compiler I'm using is Clang via Xcode 4.5.1. If the assertion doesn't fail for you, could you confirm that assertions are enabled (via the setting of a DEBUG or _DEBUG symbol) on your compiler implementation? Replacing the assert() with a `if( ... ) cout << ... ;` statement would also bypass this ambiguity. – OldPeculier Nov 03 '12 at 13:26
  • I just tried GCC 4.7.2 and confirmed that the behavior _is different_ on that compiler. – OldPeculier Nov 03 '12 at 13:31
  • @OldPeculier: I believe your question is a [duplicate of this question](http://stackoverflow.com/questions/9004715/istream-eof-discrepancy-between-libc-and-libstdc), could you confirm that? – Jesse Good Nov 03 '12 at 22:10
  • @JesseGood My question has two parts: why is the specified behavior inconsistent (and difficult to write a simple loop for), and why do compilers treat the semantics differently. That question answers the second part but not the first. – OldPeculier Nov 04 '12 at 00:42
  • "This inconsistency makes it hard to write sensible loops " - no it doesn't. Simply don't use `eof()` in your loop. – M.M Jul 27 '14 at 22:33

5 Answers5

9

The eof() flag is only useful to determine if you hit end of file after some operation. The primary use is to avoid an error message if reading reasonably failed because there wasn't anything more to read. Trying to control a loop or something using eof() is bound to fail. In all cases you need to check after you tried to read if the read was successful. Before the attempt the stream can't know what you are going to read.

The semantics of eof() is defined thoroughly as "this flag gets set when reading the stream caused the stream buffer to return a failure". It isn't quite as easy to find this statement if I recall correct but this is what comes down. At some point the standard also says that the stream is allowed to read more than it has to in some situation which may cause eof() to be set when you don't necessarily expect it. One such example is reading a character: the stream may end up detecting that there is nothing following that character and set eof().

If you want to handle an empty stream, it's trivial: look at something from the stream and proceed only if you know it's not empty:

if (stream.peek() != std::char_traits<char>::eof()) {
    do_what_needs_to_be_done_for_a_non_empty_stream();
}
else {
    do_something_else();
}
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • @Vlad: I don't think so. The stream would need to go out of the way and see if there is another character coming to provide a result which may change when you actually try to read it later because the file may have grown. It always comes down to the same issue as well: before you tried to extract something, the stream can't second guess what you'll try to extract. What may be ugly is to provide some flexibility and when exactly `eof()` is set but semantically it doesn't matter when checking after the read and it probably catered for different implementations to stay within the specification. – Dietmar Kühl Nov 02 '12 at 23:37
  • The function answers not "what the actual state of the stream is", but rather "what the system knows about the stream as a side effect of other calls". This exposes the implementation detail: the stream _must_ memorize during reads the eof status as well. That's why I still personally think it's not elegant. – Vlad Nov 02 '12 at 23:40
  • 1
    Another sign of inelegance is that the OP needs an extra `if` :) – Vlad Nov 02 '12 at 23:42
  • @Vlad: the stream are two entities: the stream buffer and the stream. The stream buffer probably has some idea where the end of the sequence is (it may not, though, because the underlying sequence may grow). The stream asks the underlying stream buffer only about the characters it feels it needs. – Dietmar Kühl Nov 02 '12 at 23:47
  • Yes, I know this and I understand how this is implied by "you don't pay for what you didn't ask for" philosophy -- but I would prefer not need to know and not need to care. – Vlad Nov 02 '12 at 23:53
  • Thanks for the explanation. But note that the original code gives different results for the last assertion depending on the compiler. Apple Clang 4.1 fires the assertion. GCC 4.7.2 does not. – OldPeculier Nov 03 '12 at 13:35
  • @OldPeculier: The definition of `std::istream::get()` is fairly vague and just states that a character is _extracted_ but there is no definition on how this is done and there are multiple alternatives, some of them leading to EOF being being touched. That said, I can't reproduce your observation on the version of clang/libc++ I'm using which may be partially due to the fact that I don't have access to the future release of clang 4.1. Also, please note that clang itself doesn't seem to ship with a standard library but can use one of many libraries: it may worth checking which library is used. – Dietmar Kühl Nov 03 '12 at 15:09
5

Never, ever check for eof alone.

The eof flag (which is the same as the eofbit bit flag in a value returned by rdstate()) is set when end-of-file is reached during an extract operation. If there were no extract operations, eofbit is never set, which is why your first check returns false.

However eofbit is no indication as to whether the operation was successful. For that, check failbit|badbit in rdstate(). failbit means "there was a logical error", and badbit means "there was an I/O error". Conveniently, there's a fail() function that returns exactly rdstate() & (failbit|badbit). Even more conveniently, there's an operator bool() function that returns !fail(). So you can do things like while(stream.read(buffer)){ ....

If the operation has failed, you may check eofbit, badbit and failbit separately to figure out why it has failed.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
1

What compiler / standard c++ library are you using? I tried it with gcc 4.6.3/4.7.2 and clang 3.1, and all of them worked just fine (i.e. the assertion does not fire).

I think you should report this as a bug in your tool-chain, since my reading of the standard accords with your intuition that eof() should not be set as long as get() is able to return a character.

rici
  • 234,347
  • 28
  • 237
  • 341
1

It's not a bug, in the sense that it's the intended behavior. It is not the intent that you use test for eof() until after input has failed. It's main purpose is for use inside extraction functions, where in early implementations, the fact that std::streambuf::sgetc() returned EOF didn't mean that it would the next time it was called: the intent was that anytime sgetc() returned EOF (now std::char_traits<>::eof(), this would be memorized, and the stream would make no further calls to the streambuf.

Practically speaking: we really need two eof(): one for internal use, as above, and another which will reliably state that failure was due to having reached end of file. As it is, given something like:

std::istringstream s( "1.23e+" );
s >> aDouble;

There's no way of detecting that the error is due to a format error, rather than the stream not having any more data. In this case, the internal eof should return true (because we have seen end of file, when looking ahead, and we want to suppress all further calls to the streambuf extractor functions), but the external one should be false, because there was data present (even after skipping initial whitespace).

If you're not implementing an extractor function, of course, you should never test ios_base::eof() until you've actually had an input failure. It was never the intent that this would provide any useful information (which makes one wonder why they defined ios_base::good()—the fact that it returns false if eof() means that it can provide nor reliable information untin fail() returns true, at which point, we know that it will return false, so there's no point in calling it).

And I'm not sure what your problem is. Because the stream cannot know in advance what your next input will be (e.g. whether it will skip whitespace or not), it cannot know in advance whether your next input will fail because of end of file or not. The idiom adopted is clear: try the input, then test whether is succeeded or not. There is no other way, because no other alternative can be implemented. Pascal did it differently, but a file in Pascal was typed—you could only read one type from it, so it could always read ahead one element under the hood, and return end of file if this read ahead failed. Not having previsional end of file is the price we pay for being able to read more than one type from a file.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
James Kanze
  • 150,581
  • 18
  • 184
  • 329
0

The behavior is somewhat subtle. eofbit is set when an attempt is made to read past the end of the file, but that may or may not cause failure of the current extraction operation.

For example:

ifstream blah;
// assume the file got opened
int i, j;
blah >> i;
if (!blah.eof())
    blah >> j;

If the file contains 142<EOF>, then the sequence of digits is terminated by end of file, so eofbit is set AND the extraction succeeds. Extraction of j won't be attempted, because the end of file has already been encountered.

If the file contains 142 <EOF>, the the sequence of digits is terminated by whitespace (extraction of i succeeds). eofbit is not set yet, so blah >> j will be executed, and it will reach end of file without finding any digits, so it will fail.

Notice how the innocuous-looking whitespace at the end of file changed the behavior.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720