std::thread to std::async makes HUGE performance gain. How it can be possible?

Question

I`ve made a test code between std::thread and std::async.

#include <iostream>
#include <mutex>
#include <fstream>
#include <string>
#include <memory>
#include <thread>
#include <future>
#include <functional>
#include <boost/noncopyable.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/filesystem.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/asio.hpp>

namespace fs = boost::filesystem;
namespace pt = boost::posix_time;
namespace as = boost::asio;
class Log : private boost::noncopyable
{
public:
    void LogPath(const fs::path& filePath) {
        boost::system::error_code ec;
        if(fs::exists(filePath, ec)) {
            fs::remove(filePath);
        }
        this->ofStreamPtr_.reset(new fs::ofstream(filePath));
    };

    void WriteLog(std::size_t i) {
        assert(*this->ofStreamPtr_);
        std::lock_guard<std::mutex> lock(this->logMutex_);
        *this->ofStreamPtr_ << "Hello, World! " << i << "\n";
    };

private:
    std::mutex logMutex_;
    std::unique_ptr<fs::ofstream> ofStreamPtr_;
};

int main(int argc, char *argv[]) {
    if(argc != 2) {
        std::cout << "Wrong argument" << std::endl;
        exit(1);
    }
    std::size_t iter_count = boost::lexical_cast<std::size_t>(argv[1]);

    Log log;
    log.LogPath("log.txt");

    std::function<void(std::size_t)> func = std::bind(&Log::WriteLog, &log, std::placeholders::_1);

    auto start_time = pt::microsec_clock::local_time();
    ////// Version 1: use std::thread //////
//    {
//        std::vector<std::shared_ptr<std::thread> > threadList;
//        threadList.reserve(iter_count);
//        for(std::size_t i = 0; i < iter_count; i++) {
//            threadList.push_back(
//                std::make_shared<std::thread>(func, i));
//        }
//
//        for(auto it: threadList) {
//            it->join();
//        }
//    }

//    pt::time_duration duration = pt::microsec_clock::local_time() - start_time;
//    std::cout << "Version 1: " << duration << std::endl;

    ////// Version 2: use std::async //////
    start_time = pt::microsec_clock::local_time();
    {
        for(std::size_t i = 0; i < iter_count; i++) {
            auto result = std::async(func, i);
        }
    }

    duration = pt::microsec_clock::local_time() - start_time;
    std::cout << "Version 2: " << duration << std::endl;

    ////// Version 3: use boost::asio::io_service //////
//    start_time = pt::microsec_clock::local_time();
//    {
//        as::io_service ioService;
//        as::io_service::strand strand{ioService};
//        {
//            for(std::size_t i = 0; i < iter_count; i++) {
//                strand.post(std::bind(func, i));
//            }
//        }
//        ioService.run();
//    }

//    duration = pt::microsec_clock::local_time() - start_time;
//    std::cout << "Version 3: " << duration << std::endl;


}

With 4-core CentOS 7 box(gcc 4.8.5), Version 1(using std::thread) is about 100x slower compared to other implementations.

Iteration Version1   Version2   Version3
100       0.0034s    0.000051s  0.000066s
1000      0.038s     0.00029s   0.00058s
10000     0.41s      0.0042s    0.0059s
100000    throw      0.026s     0.061s

Why threaded version is so slow? I thought each thread won't take long time to complete Log::WriteLog function.

In my opinion you are firing up too many threads (more than cpu cores) and because of they all are competing for cpu time and context switching, it's slow. In case of async, runtime is managing and executing your code efficiently on just enough threads and yielding processor time where needed. — Saleem, Mar 21 '16 at 04:28
making thread is _very_ expansive. anything more than number of cores will decrease performance (ignoring threads that blocked by locks/IO). that is why thread-pool is recommended. — Bryan Chen, Mar 21 '16 at 04:30
That your code fails with 100000 iterations is a big enough hint. A thread is an expensive operating system object and you pay for the cost of creating them and tearing them down again. If the amount of work done by the thread is this small then you definitely see the overhead. An std::async implementation can amortize that cost, using a threadpool is a standard technique. A rough guideline is that a thread should run for a minimum of 100 microseconds, an async function ought not take more than a second. — Hans Passant, Mar 21 '16 at 07:26

score 2 · Accepted Answer · answered Mar 21 '16 at 04:30

The function may never be called. You are not passing an std::launch policy in Version 2, so you are relying on the default behavior of std::async (emphasis mine):

Behaves the same as async(std::launch::async | std::launch::deferred, f, args...). In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.

Try re-running your benchmark with this minor change:

auto result = std::async(std::launch::async, func, i);

Alternatively, you could call result.wait() on each std::future in a second loop, similar to how you call join() on all of the threads in Version 1. This forces evaluation of the std::future.

Note that there is a major, unrelated, problem with this benchmark. func immediately acquires a lock for the full duration of the function call, which makes parallelism impossible. There is no advantage to using threads here - I suspect that it will be significantly slower (due to thread creation and locking overhead) than a serial implementation.

Yes, you're right. Without specifying launch policy, that code never runs... I changed my code to use std::launch::async, and it makes effectively same performace as Version 1.(0.29s for 10000 iterrations) Note: there weren`t any performance gains with removing ```std::lock_guard``` — Byoungchan Lee, Mar 21 '16 at 04:45
I suggest benchmarking the parallel code against a serial implementation as a sanity check. You definitely should not remove the `std::lock_guard` because `std::fstream` is not thread safe. Also, if this answered your original question, please consider accepting the answer. — Michael Koval, Mar 21 '16 at 15:43

std::thread to std::async makes HUGE performance gain. How it can be possible?

1 Answers1