2

Target: There is text file (on HDD) containing integers divided with some kind of delimiter.

Example:

5245
234224
6534
1234

I need to read them into STL container.

int main(int argc, char * argv[]) {
  using namespace std;

  // 1. prepare the file stream
  string fileName;
  if (argc > 1)
    fileName = argv[1];
  else {
    cout << "Provide the filename to read from: ";
    cin >> fileName;
  }
  unique_ptr<ifstream, ifstream_deleter<ifstream>> ptrToStream(new ifstream(fileName, ios::out));
  if (!ptrToStream->good()) {
    cerr << "Error opening file " << fileName << endl;
    return -1;
  }

  // 2. value by value reading will be too slow on large data so buffer data
  typedef unsigned int values_type;
  const int BUFFER_SIZE(4); // 4 is for testing purposes. 16MB or larger in real life
  vector<values_type> numbersBuffer(BUFFER_SIZE);
  numbersBuffer.insert(numbersBuffer.begin(), istream_iterator<values_type>(*ptrToStream), istream_iterator<values_type>());
  // ...

The main drawback of this code is how can I handle the issue when file size is extremely large, so I cannot store all of it's contents in memory ? I also do not want to use push_back as it is non efficient in comparison to interval insert.


So, the question is: how can I read not more than BUFFER_SIZE elements from the file effectively using STL?

tshepang
  • 12,111
  • 21
  • 91
  • 136
nickolay
  • 3,643
  • 3
  • 32
  • 40

2 Answers2

4

The approach to limit reading from input iterators is to create a wrapper which counts the number of elements processed so far and whose end iterator compares to this number. Doing this generically isn't quite trivial, doing it specifically for std::istream_iterator<T> shouldn't be too hard. That said, I think the easiest way to do it is this:

std::vector<T> buffer;
buffer.reserve(size);
std::istreambuf_iterator<T> it(in), end;
for (std::vector<T>::size_type count(0), capacity(size);
     it != end && count != capacity; ++it, ++count) {
    buffer.push_back(*it);
}

I realize that you don't want to push_back() because it is allegedly slow. However, compared to the I/O operation I doubt that you'll be able to measure the small overhead, especially with typical implementation of the I/O library.

Just to round things off with an example of a wrapped iterator: below is an example how a counting wrapper for std::istream_iterator<T> could look like. There are many different ways this could be done, this is just one of them.

#include <iostream>
#include <iterator>
#include <vector>
#include <sstream>

template <typename T>
class counted_istream_iterator:
    public std::iterator<std::input_iterator_tag, T, std::ptrdiff_t>
{
public:
    explicit counted_istream_iterator(std::istream& in): count_(), it_(in) {}
    explicit counted_istream_iterator(size_t count): count_(count), it_() {}

    T const& operator*() { return *this->it_; }
    T const* operator->() { return it_->it_.operator->(); }
    counted_istream_iterator& operator++() {
        ++this->count_; ++this->it_; return *this;
    }
    counted_istream_iterator operator++(int) {
        counted_istream_iterator rc(*this); ++*this; return rc;
    }

    bool operator== (counted_istream_iterator const& other) const {
        return this->count_ == other.count_ || this->it_ == other.it_;
    }
    bool operator!= (counted_istream_iterator const& other) const {
        return !(*this == other);
    }
private:
    std::ptrdiff_t           count_;
    std::istream_iterator<T> it_;
};

void read(int count)
{
    std::istringstream in("0 1 2 3 4 5 6 7 8 9");
    std::vector<int>   vec;
    vec.insert(vec.end(), counted_istream_iterator<int>(in),
               counted_istream_iterator<int>(count));
    std::cout << "size=" << vec.size() << "\n";
}

int main()
{
    read(4);
    read(100);
}
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • Thanks! But I would like to make `end` iterator point to next by `begin + size` iterator and dunno how. Anyway is there is a way to use interval `insert` version? – nickolay Feb 17 '12 at 23:15
  • As I mentioned in the answer already: yes. You would create a wrapper holding an `std::istream_iterator` and a count and compare both the iterator and the count to determine whether you have reached the end. – Dietmar Kühl Feb 18 '12 at 01:38
  • Hi! Please, explain me a bit about the wrapper. Do you mean I should use `std::istream_iterator::operator++()` to retrieve only needed portion of data from `istream`? How it is better to transfer this data to `vector`? – nickolay Feb 18 '12 at 18:09
  • 1
    No. What I mean is that you could write an iterator using an `std::isteram_iterator` internally combined with a count which stops when the desired number of elements is read. I have added an example of this to my answer. – Dietmar Kühl Feb 18 '12 at 22:51
  • If the file is especially large, you may want to cache the offset in the file where every 100th number is, making arbitrary seek time significantly faster. – Mooing Duck Feb 18 '12 at 23:04
  • @MooingDuck: this is an input iterator - it just traverses the stream once put stops possibly earlier than reaching the end of the file. There are no seeks being done at all (nor should there be any seeks: each seek effectively kills the stream's buffer and causes things to be unnecessarily slow). – Dietmar Kühl Feb 18 '12 at 23:09
  • @DietmarKühl: Apparently I misunderstood "I cannot store all of it's contents in memory" – Mooing Duck Feb 19 '12 at 00:46
  • @DietmarKühl Dietmar, can you clarify me one more issue about `istream_iterator`? The issue is the following: if `istream_iterator` will meet non `int` value (i.e. the value that cannot be implicitly converted to `int`, it will return `end`. The file can be not read up to the end in this case. Neither exception will be thrown not error code given. How can I handle this situation? (This situation can be clearly seen in my answer below). – nickolay Feb 19 '12 at 20:04
  • @DietmarKühl The code with `istreambuf_iterator` cannot read `ints`. typedef unsigned int values_type; const int BUFFER_SIZE(4); vector numbersBuffer(BUFFER_SIZE); basic_ifstream f("C:\\numbers.txt"); istreambuf_iterator it(f), end; for (vector::size_type count(0) ; it != end && count != BUFFER_SIZE; ++it, ++count) { numbersBuffer.push_back(*it); } // This code doesn't read ints. Moreover `vector::push_back()` adds elements to the end of the vector after allocated by `vector::resize()` space. Please, correct. – nickolay Feb 19 '12 at 20:23
  • @DietmarKühl To summarize the problems we have: 1) how to recognize bad values? e.g. for `int`: 10 12 12345678910 13 15 - 12345678910 is a 'bad' value, because reading by means of `istream::iterator` or `istreambuf_iterator` will stop on this value leaving 13 and 15 not read. No exception or error code is provided (as I know). 2) your first code example correction THANKS – nickolay Feb 19 '12 at 20:34
  • First to get this out of the way: `std::istreambuf_iterator` is to read individual characters from a stream with character type `cT`. You can't choose the character type because it is defined by the respective stream type (e.g. `char` for `std::ifstream` or `wchar_t` for `std::wifstream`). – Dietmar Kühl Feb 19 '12 at 22:52
  • The `std::istream_iterator` just tries to read a value of type `T` effectively doing something like `T value; in >> value`. If this fails, `std::ios_base::failbit` will be set for the stream, i.e. you can test the status of the stream used by the iterator (e.g. `if (!in) { ... }`). When having read the entire file the failure may be due because the end of the file is reached. When reaching the `std::ios_base::eofbit` is set. This can be tested with `if (in.eof()) { ... }` i.e. you have a genuine read error if `std::ios_base::failbit` is set but `std::ios_base::eofbit` is not. – Dietmar Kühl Feb 19 '12 at 22:58
  • @DietmarKühl at `std::istreambuf_iterator` - Ok. You propose to read chars from the stream and then convert them to integer? – nickolay Feb 20 '12 at 06:42
  • @DietmarKühl at `failbit` - thanks for clarifying. I've not known that! The last question: how to continue reading from the next to failure item? Suggest, please! – nickolay Feb 20 '12 at 06:44
  • 1
    @DaddyM: re `std::istreambuf_iterator`: no. I suggest leaving this one alone unless you actually need to interact with the `std::streambuf` (which you don't in this case). re error recovery: use `clear()` to get the stream back into a usable state and then skip the bad data. How this looks depends in your needs. You might want to use `ignore()` to ignore the remainder of the line. – Dietmar Kühl Feb 20 '12 at 11:56
  • @DietmarKühl I still have 1 question regarding your first code block with `vector` and `streambuf`. This code will not compile if `T` will be `int` because `in` is something like `std::ifstream` that is a `basic_ifstream` not `basic_ifstream`. Please, clarify. Thanks. – nickolay Feb 20 '12 at 14:47
  • @DaddyM: I don't know where you got the `std::istreambuf_iterator` from: my code certainly doesn't use it! I would know that it doesn't work... My code uses a `std::istream_iterator`. Note the intentional absense of `buf` in the latter type. Actually, I'm pretty sure I tested the code above. Did you try to copy&paste it? (I hope nobody catches me advising the use of copy&paste...) – Dietmar Kühl Feb 20 '12 at 22:38
  • @DietmarKühl Dietmar, I mean that this code would not compile: `typedef unsigned int T; int const size(10); std::fstream f("C:\1.txt"); std::vector buffer; buffer.reserve(size); std::istreambuf_iterator it(f), end; for (std::vector::size_type count(0), capacity(size); it != end && count != capacity; ++it, ++count) { buffer.push_back(*it); }` // What shall I use instead your `in` ?? – nickolay Feb 21 '12 at 07:19
  • @DaddyM: You **don't** want to use `std::istreambuf_iterator`: this class is used to extract characters from a stream. You **do** want to use `std::istream_iterator`: this class is used to parse character sequences into `T` objects. I already was very clear on this! – Dietmar Kühl Feb 21 '12 at 10:58
  • @DietmarKühl I've just copied your first code and detalized it a bit (e.g. changed `in` to real stream). – nickolay Feb 21 '12 at 14:05
0

There is possible way to solve my problem:

// 2. value by value reading will be too slow on large data so buffer data
typedef unsigned int values_type;
const int BUFFER_SIZE(4);
vector<values_type> numbersBuffer;
numbersBuffer.reserve(BUFFER_SIZE);
istream_iterator<values_type> begin(*ptrToStream), end;
while (begin != end) {
  copy_n(begin, BUFFER_SIZE, numbersBuffer.begin());
  for_each(numbersBuffer.begin(), numbersBuffer.end(), [](values_type const &val){ std::cout << val << std::endl; });
  ++begin;
}

But it has one drawback. If input file contains the following:

8785
245245454545
7767

then 8785 will be read, but 245245454545 and 7767 will not, because 245245454545 cannot be converted to unsigned int. Error will be silent. :(

nickolay
  • 3,643
  • 3
  • 32
  • 40
  • 1
    Why would this be a solution better than using `push_back()` with a loop or, equivalently, using `std::copy_n(it, n, std::back_inserter(buffer)`? (interestingly, I hadn't realized that `std::copy_n()` was added to the algorithms library otherwise I wouldn't have used a loop) First setting up the buffer unnecessarily calls the default constructors for values which are about to be overridden. The only think you might want to do is to use `reserve()` on the container to make sure it doesn't need to be resized. – Dietmar Kühl Feb 19 '12 at 23:09
  • @DietmarKühl Probably, I've misunderstood you. But I've never said that this solution is better than others. – nickolay Feb 20 '12 at 06:45
  • @DietmarKühl Thank you for clarification! QUOTE: First setting up the buffer unnecessarily calls the default constructors for values which are about to be overridden. The only think you might want to do is to use reserve() on the container to make sure it doesn't need to be resized. – nickolay Feb 20 '12 at 06:51