6

As it is, there are many approaches to reading a file into a string. Two common ones are using ifstream::read to read directly to a string and using steambuf_iterators along with std::copy_n:

Using ifstream::read:

std::ifstream in {"./filename.txt"};
std::string contents;
in.seekg(0, in.end);
contents.resize(in.tellg());
in.seekg(0, in.beg);
in.read(&contents[0], contents.size());

Using std::copy_n:

std::ifstream in {"./filename.txt"};
std::string contents;
in.seekg(0, in.end);
contents.resize(in.tellg());
in.seekg(0, in.beg);
std::copy_n(std::streambuf_iterator<char>(in), 
            contents.size(), 
            contents.begin();

Many benchmarks show that the first approach is much faster than the second one (in my machine using g++-4.9 it is about 10 times faster with both -O2 and -O3 flags) and I was wondering what may be the reason for this difference in performance.

Veritas
  • 2,150
  • 3
  • 16
  • 40
  • 1
    I suspect the iterator reads one character at a time. That would explain the difference. – R Sahu Apr 20 '15 at 21:29
  • @RSahu: More specifically, the iterator uses 1+ virtual calls for each byte, wheras `read` is 1-2 virtual calls per buffer. – Mooing Duck Apr 20 '15 at 21:32
  • 1
    side comment: you can also construct the string in place like `std::string contents(std::streambuf_iterator(in), {});`, no need for `copy_n` and getting the size of the file. But probably it's not going to make a big difference in terms of speed. – vsoftco Apr 20 '15 at 21:32
  • @RSahu It really shouldn’t though, that would be a terrible implementation. In fact, I thought I’d remembered that the libstdc++ implementation was surprisingly smart with file buffer iterators, even pre-sizing the buffer (which would require determining that the buffer iterator points into a seekable file). I may misremember that though. – Konrad Rudolph Apr 20 '15 at 21:33
  • How big a file are we talking about? It probably depends on that. Also, you may want to be careful reading the whole thing into memory in one big chunk without knowing how big it might be. Depending on what kind of processing you're doing, the program may be more efficient reading in a line at a time. – Dan Korn Apr 20 '15 at 21:34
  • @KonradRudolph I wonder though how can the iterator "know" the length of the file? Only in that case it can attempt somehow to read a whole block. But of course that's not a proof that it cannot be smarter than I think. – vsoftco Apr 20 '15 at 21:34
  • @vsoftco Essentially, by holding on to a reference to the owning buffer. – Konrad Rudolph Apr 20 '15 at 21:35
  • @KonradRudolph Yes exactly, it was just crossing my mind now... – vsoftco Apr 20 '15 at 21:37
  • @Veritas, what about clang++? Did you try see how fast it is? I am just curios about libc++ implementation. – vsoftco Apr 20 '15 at 21:39
  • @vsoftco No but this guy has: http://insanecoding.blogspot.gr/2011/11/how-to-read-in-file-in-c.html albeit his approach is a bit different (std::copy instead of std::copy_n and using std::back_inserter). – Veritas Apr 20 '15 at 21:41
  • A faster method is to tell `std::string` an approximate size before you create it. Otherwise your program will feel the pain of the `std::string` resizing. – Thomas Matthews Apr 21 '15 at 00:08
  • @ThomasMatthews they are resized in the examples. – Veritas Apr 21 '15 at 03:36

1 Answers1

2

read is a single iostream setup (part of every iostream operation) and a single call to the OS, reading directly into the buffer you provided.

The iterator works by repeatedly extracting a single char with operator>>. Because of the buffer size, this might mean more OS calls, but more importantly it also means repeated setting up and tearing down of the iostream sentry, which might mean a mutex lock, and usually means a bunch of other stuff. Furthermore, operator>> is a formatted operation, whereas read is unformatted, which is additional setup overhead on every operation.

Edit: Tired eyes saw istream_iterator instead of istreambuf_iterator. Of course istreambuf_iterator does not do formatted input. It calls sbumpc or something like that on the streambuf. Still a lot of calls, and using the buffer, which is probably smaller than the entire file.

Sebastian Redl
  • 69,373
  • 8
  • 123
  • 157
  • 1
    “The iterator works by repeatedly extracting a single char with `operator>>`” — any reference for that? In fact, `operator>>` does formatted input, while `istreambuf_iterator` does *unformatted* input, so I think that’s unlikely. (EDIT: According to [cppreference.com](http://en.cppreference.com/w/cpp/iterator/istreambuf_iterator/operator*) this answer is wrong.) – Konrad Rudolph Apr 20 '15 at 21:34
  • That is sort of the definition of what a streambuf_iterator is, Konrad. Each time it is used, it extracts a single char from the corresponding stream buffer. – Peter Apr 20 '15 at 21:39
  • @Peter The point was mainly about formatted vs unformatted input, which makes a huge difference. The next question is whether the compiler can inline all the calls in the call chain of the iterator’s `operator*`, and could then perform loop unrolling. Or whether it’s legal for an implementation of `std::copy{_n}` to dispatch the call to a method which manually replaces the iterator copying with a more efficient method (sort of what `std::copy` does for `IterT=char*`, which is dispatched to `std::memcpy` in modern implementations). – Konrad Rudolph Apr 20 '15 at 21:43
  • edit: The standard does require that basic_istream::read creates a sentry object but not explicitly that it has to call sbumpc/sgetc to get the characters. Therefore a single OS call seems plausible. – Veritas Apr 20 '15 at 21:55