-1

I Need to read a file line-by-line twice. The file content is expected to fit into memory. So, I would normally read the whole file into a buffer and work with that buffer afterwards.

However, since I would like to use std::getline, I need to work with a std::basic_istream. So, I thought it would be a good idea to write

std::ifstream file(filepath);
std::stringstream ss;
ss << file.rdbuf();

for (std::string line; std::getline(ss, line);)
{
}

However, I'm not sure what exactly is happening here. I guess ss << file.rdbuf(); does not read the file into any internal buffer of ss. Actual file access should occure only at std::getline(ss, line);.

So, with a second for-loop of the provided form, I should end in reading the whole file once again. That's inefficient.

Am I correct and hence need to come up with an other approach?

0xbadf00d
  • 17,405
  • 15
  • 67
  • 107
  • 1
    You need to seek back to the beginning of `ss` before the second loop. – Barmar Apr 05 '18 at 22:09
  • Why not simply `for (std::string line; std::getline(file, line);) {}` why do you need stringstream in between? – Killzone Kid Apr 05 '18 at 22:10
  • 1
    Why don't you just read it into a vector of strings the first time, then you can use the vector for the second loop. – Barmar Apr 05 '18 at 22:10
  • 1
    @KillzoneKid He doesn't want to read from the file twice, he wants to cache it in a string stream in memory. – Barmar Apr 05 '18 at 22:10
  • 3
    _I guess `ss << file.rdbuf();` does not read the file into any internal buffer of `ss`._ Yes it does. – Miles Budnek Apr 05 '18 at 22:11
  • The answer you checked is a good answer, no, a great answer - but not for your XY problem. What you really want to know is why your program does not work, not whether you guessed wrong about how stringstreams behave. – Jive Dadson Apr 05 '18 at 23:56
  • "*However, since I would like to use `std::getline`, I need to work with a `std::basic_istream`*" - `std::ifstream` derives from `std::basic_istream`, so you can pass a `std::ifstream` object directly to `std::getline()`. If you want to read the file a second time, just seek the `ifstream` back to the beginning of the file – Remy Lebeau Apr 06 '18 at 01:52

3 Answers3

2

I guess ss << file.rdbuf(); does not read the file into any internal buffer of ss. Actual file access should occure only at std::getline(ss, line);.

This is incorrect. cppreference.com has this to say about that operator<< overload:

basic_ostream& operator<<( std::basic_streambuf<CharT, Traits>* sb); (9)

9) Behaves as an UnformattedOutputFunction. After constructing and checking the sentry object, checks if sb is a null pointer. If it is, executes setstate(badbit) and exits. Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met:

  • end-of-file occurs on the input sequence;
  • inserting in the output sequence fails (in which case the character to be inserted is not extracted);
  • an exception occurs (in which case the exception is caught).

If no characters were inserted, executes setstate(failbit). If an exception was thrown while extracting, sets failbit and, if failbit is set in exceptions(), rethrows the exception.

So your assumption is incorrect. The entire contents of file is copied to the buffer controlled by ss, so reading from ss does not access the filesystem. You can freely read through ss and seek back to the beginning as many times as you like without incurring the overhead of re-reading the file each time.

Miles Budnek
  • 28,216
  • 2
  • 35
  • 52
  • Is there a better way to fetch the lines from `ss`? `std::getline(ss, line);` will cause memory allocation and copying of the data. This seems to be inefficient. Maybe there is a solution using a `std::string_view`? – 0xbadf00d Apr 05 '18 at 23:51
  • There is no way to avoid memory allocation and copying the data. It starts in on a disk or something and winds up in memory. Copying contiguous data in memory is virtually free compared to reading it from disk. The only way to optimize involves measuring a real application's performance. But as I said, I doubt that your actual problem has anything to do with using a streambuf. I think the program is treating a Unicode file like US-ASCII. Try the experiment I suggested. – Jive Dadson Apr 06 '18 at 01:42
  • @JiveDadson I guess you've got me wrong. What I meant is that (after `ss << file.rdbuf();`) `ss` already contains the content of the file in an internal buffer (maybe even a `std::string`). So, it would make sense to have a `getline`-like function, which returns a `std::string_view` of the substring representing the next line. – 0xbadf00d Apr 06 '18 at 19:52
0

After the first loop, clear the EOF and fail bits and go back to the beginning of the stringstream with:

ss.clear();
ss.seekg(0, std::ios::beg);
Barmar
  • 741,623
  • 53
  • 500
  • 612
0

Am I correct and hence need to come up with an other approach?

You're not correct. The "hense" is unwarranted also. There's not enough info in the question, but I suspect the problem has nothing to do with using a stream buffer.

Without knowing what that first "garbage" character is, I cannot say for sure, but I suspect the file is in a wide-character unicode format, and you are using access operations that do not work on wide characters. If that is the case, buffering the file has nothing to do with the problem.

As an experiment, try the following. Mind the w's.

    std::wifstream file(filepath);
    std::wstringstream ss;
    ss << file.rdbuf();

    for (int i = 0; i < 42; ++i) {
        wchar_t ch;
        ss >> ch;
        std::cout << static_cast<unsigned>(ch) << ' ';
    }

It would not surprise me if the first four numbers are 255 254 92 0, or 255 254 47 0.

This might help: Problem using getline with unicode files

Jive Dadson
  • 16,680
  • 9
  • 52
  • 65