ostream_iterator vs for each loop efficiency

Question

I saw this users post yesterday. And I thought that was a cool way to output a vector. So I typed up an example and asked myself how does this compare to a for each loop?

template <typename T>
void printVectorO(std::vector<T> &v)
{
    std::cout << "Ostream_iterator contents: " << std::endl;
    auto start = std::chrono::high_resolution_clock::now();
    std::ostream_iterator<T> ost(std::cout, " ");
    std::copy(begin(v), end(v), ost);
    std::cout << std::endl;

    auto end = std::chrono::high_resolution_clock::now();
    auto time = end - start;
    auto nano = std::chrono::duration_cast<std::chrono::nanoseconds>(time);
    std::cout << "Ostream_iterator computation took: " << nano.count() << " nano seconds"<< std::endl;
    std::cout << std::endl;
}

template <typename T>
void printVectorC(std::vector<T> &v)
{
    std::cout << "For Each Loop contents: " << std::endl;
    auto start = std::chrono::high_resolution_clock::now();
    for (auto && e : v) std::cout << e << " ";
    std::cout << std::endl;

    auto end = std::chrono::high_resolution_clock::now();
    auto time = end - start;
    auto nano = std::chrono::duration_cast<std::chrono::nanoseconds>(time);
    std::cout << "For Each Loop took: " << nano.count() << " nano seconds" << std::endl;
    std::cout << std::endl;
}

I used 3 vectors to test this:

std::vector<double> doubles = { 3.15, 2.17, 2.555, 2.014 };
std::vector<std::string> strings = { "Hi", "how", "are", "you" };
std::vector<int> ints = { 3, 2 , 2 , 2 };

And I get various results. The for each loop always beats the ostream_iterator when I output the doubles (ex 41856 vs 11207 and 55198 vs 10308 nanoseconds). Sometimes the string ostream_iterator beats out the for each loop, and the for each loop and ostream_iterator almost stay neck and neck with integers.

Why is this? What is going on behind the scenes of ostream_iterator? When would I use ostream_iterator over a for each loop when it comes to efficiency and speed?

Recommend adding compiler and optimization level to help establish a frame of reference. — user4581301, May 07 '18 at 23:36
Factor out the time of the `endl` from your timing, since your arrays are so short. — jxh, May 07 '18 at 23:37
Perhaps add a loop around your call to `copy` and a loop around your `for` to repeat the code of interest something like a million times to factor out other things that might affect your timing. — jxh, May 07 '18 at 23:39
In addition to jxh's comment, perhaps add a loop to generate a few hundred thousand values to put into the vectors so the differences will be more distinct (ie. check if the difference is always a few nanoseconds vs. if they grow proportionally). — scohe001, May 07 '18 at 23:40
I am using `g++ -std=c++14` with the `-Ofast` flag on release. I will add the loops here soon and let you know the results! — Sailanarmo, May 07 '18 at 23:44
I agree, with just three values any metrics are meaningless. Try a few thousand values. Also try outputing to a `std::ostringstream` rather than `std::cout`. Outputing to `std::cout` may reflect some artifacts regarding output buffer flushing. — Sam Varshavchik, May 08 '18 at 00:13
@scohe001 my program is currently running with 3 vectors of size 3,000 of type strings, doubles, and integers. Did you mean for me to start my vector of size 1, then iterate to size 3000? Or is the way I am doing it just fine? — Sailanarmo, May 08 '18 at 03:59
Ahh you said "I used 3 vectors to test this:" and then listed 3 itty bitty vectors. Even then, I would suggest 10,000+ if you can — scohe001, May 08 '18 at 04:15
@scohe001 I originally used those 3 itty bitty vectors, but I updated my program and those 3 itty bitty vectors are now 3 vectors of size 3000. But I created them as size 3000, should I iterate the size to make this a better test? Or is it just fine at 3,000? I can also make them size 10,000+ if needed. — Sailanarmo, May 08 '18 at 04:44
The cost of displaying characters on the console (and scrolling the console) will dwarf the negligible performance difference of the two iterating strategies. The only way a benchmark will make sense is to stream out to a std::ostream whose underlying string already has suffcient capacity to prevent re-allocations. Another alternative would be to create your own streambuf-derived stream buffer and assign it to an ostream. — Richard Hodges, May 08 '18 at 06:17
@RichardHodges do you have a link or a reference where I can find more information to stream to a std::ostream? — Sailanarmo, May 08 '18 at 06:20
ostream: http://en.cppreference.com/w/cpp/io/basic_ostream, and more importantly: http://en.cppreference.com/w/cpp/io/basic_streambuf — Richard Hodges, May 08 '18 at 06:25

Michael Veksler · Answer 1 · 2018-07-21T11:38:47.730

Beware of micro-benchmarks.

I have several general comments regarding this code:

Pass read-only variables as a const reference not as a regular reference. This does not effect performance, though
Don't use std::endl since it calls flush() which ends up taking most of your run-time in such a micro benchmark. For example, printing doubles took 37010 ns with std::endl and only 4456 ns with '\n'
A single measurement is inaccurate. In order to smooth out any measurement noise you must run it many times in a loop. This is still imperfect, since the best thing is to run tests interchangeably (to make random events, that can slow-down the code, affect both implementations the same way)
Better to redirect it to a file, otherwise terminal speed will dominate the results.

Here is the corrected benchmark:

constexpr unsigned ITERATIONS = 1000000;
template <typename T>
void printVectorO(const std::vector<T> &v)
{
    std::cout << "Ostream_iterator contents\n";
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned i=0 ; i < ITERATIONS; ++i) {
        std::ostream_iterator<T> ost(std::cout, " ");
        std::copy(begin(v), end(v), ost);
        std::cout << '\n';
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto time = end - start;
    auto nano = std::chrono::duration_cast<std::chrono::nanoseconds>(time);
    std::cout << "Ostream_iterator computation took: "
              << nano.count() / ITERATIONS << " nano seconds\n\n";
}

template <typename T>
void printVectorC(const std::vector<T> &v)
{
    std::cout << "For Each Loop contents\n";
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned i=0 ; i < ITERATIONS ; ++i) {
        for (auto && e : v) std::cout << e << " ";
        std::cout << '\n';
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto time = end - start;
    auto nano = std::chrono::duration_cast<std::chrono::nanoseconds>(time);
    std::cout << "For Each Loop took: "
              << nano.count() / ITERATIONS << " nano seconds\n\n";
}

And invoking it with:

template <class Container>
void test(const Container & ctr)
{
    printVectorC2(ctr);
    printVectorO2(ctr);
}


int main()
{
    std::vector<double> doubles = { 3.15, 2.17, 2.555, 2.014 };
    test(doubles);
    std::vector<std::string> strings = { "Hi", "how", "are", "you" };
    test(strings);
    std::vector<int> ints = { 3, 2 , 2 , 2 };
    test(ints);
}

And now, after grepping for nano we have:

For Each Loop took: 2045 nano seconds
Ostream_iterator computation took: 2033 nano seconds
For Each Loop took: 487 nano seconds
Ostream_iterator computation took: 485 nano seconds
For Each Loop took: 503 nano seconds
Ostream_iterator computation took: 499 nano seconds

There is barely any difference. Actually, with this specific run it seems that the ostream version is faster. But running it again gives slightly different results.

ostream_iterator vs for each loop efficiency

1 Answers1