Using std::async slower than non-async method to populate a vector

Question

I am experimenting with std::async to populate a vector. The idea behind it is to use multi-threading to save time. However, running some benchmark tests I find that my non-async method is faster!

#include <algorithm>
#include <vector>
#include <future>

std::vector<int> Generate(int i)
{
    std::vector<int> v;
    for (int j = i; j < i + 10; ++j)
    {
        v.push_back(j);
    }
    return v;
}

Async:

std::vector<std::future<std::vector<int>>> futures;
for (int i = 0; i < 200; i+=10)
{
  futures.push_back(std::async(
    [](int i) { return Generate(i); }, i));
}

std::vector<int> res;
for (auto &&f : futures)
{
  auto vec = f.get();
  res.insert(std::end(res), std::begin(vec), std::end(vec));
}

Non-async:

std::vector<int> res;
for (int i = 0; i < 200; i+=10)
{
   auto vec = Generate(i);
   res.insert(std::end(res), std::begin(vec), std::end(vec));
}

My benchmark test shows that the async method is 71 times slower than non-async. What am I doing wrong?

Have you measured how long your context switch (caused by mutex lock-guard) takes? — 2785528, Aug 28 '19 at 00:30

score 4 · Answer 1 · answered Aug 27 '19 at 21:55

std::async has two modes of operation:

std::launch::async
std::launch::deferred

In this case, you've called std::async without specifying either one, which means it's allowed to choose either one. std::launch::deferred basically means do the work on the calling thread. So std::async returns a future, and with std::launch::deferred, the action you've requested won't be carried out until you call .get on that future. It can be kind of handy under a few circumstances, but it's probably not what you want here.

Even if you specify std::launch::async, you need to realize that this starts up a new thread of execution to carry out the action you've requested. It then has to create a future, and use some sort of signalling from the thread to the future to let you know when the computation you've requested is done.

All of that adds a fair amount of overhead--anywhere from microseconds to milliseconds or so, depending on the OS, CPU, etc.

So, for asynchronous execution to make sense, the "stuff" you do asynchronously typically needs to take tens of milliseconds at the very least (and hundreds of milliseconds might be a more sensible lower threshold). I wouldn't get too wrapped up in the exact cutoff, but it needs to be something that takes a while.

So, filling an array asynchronously probably only makes sense if the array is quite a lot larger than you're dealing with here.

For filling memory, you'll quickly run into another problem though: most CPUs are enough faster than main memory that if all you're doing is writing to memory, there's a pretty good chance that a single thread will already saturate the path to memory, so even at best doing the job asynchronously will only gain a little, and may still pretty easily cause a slow-down.

The ideal case for asynchronous operation would be something like one thread that's heavily memory bound, but another that (for example) reads a little bit of data, and does a lot of computation on that small amount of data. In this case, the computation thread will mostly operate on its data in the cache, so it won't get in the way of the memory thread doing its thing.

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

There are multiple factors that are causing the Multithreaded code to perform (much) slower than the Singlethreaded code.

Your array sizes are too small

Multithreading often has negligible-to-no effect on datasets that are particularly small. In both versions of your code, you're generating 2000 integers, and each Logical Thread (which, because std::async is often implemented in terms of thread pools, might not be the same as a Software Thread) is only generating 10 integers. The cost of spooling up a thread every 10 integers way offsets the benefit of generating those integers in parallel.

You might see a performance gain if each thread were instead responsible for, say, 10,000 integers each, but you'll probably instead have a different issue:

All your code is bottlenecked by an inherently serial process

Both versions of the code copy the generated integers into a host vector. It would be one thing if the act of generating those integers was itself a time consuming process, but in your case, it's likely just a matter of a small, fast bit of assembly generating each integer.

So the act of copying each integer into the final vector is probably not inherently faster than generating each integer, meaning a sizable chunk of the "work" being done is completely serial, defeating the whole purpose of multithreading your code.

Fixing the code

Compilers are very good at their jobs, so in trying to revise your code, I was only barely able to get multithreaded code that was faster than the serial code. Multiple executions had varying results, so my general assessment is that this kind of code is bad at being multithreaded.

But here's what I came up with:

#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>

//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;

std::vector<int> Generate(int i) {
    std::vector<int> v;
    for (int j = i; j < i + BLOCK_SIZE; ++j) {
        v.push_back(j);
    }
    return v;
}

void asynchronous_attempt() {
    std::vector<std::future<void>> futures;
    //#2: Preallocated Vector
    std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
    auto it = res.begin();
    for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
    {
      futures.push_back(std::async(
        [it](int i) { 
            auto vec = Generate(i); 
            //#3 Copying done multithreaded
            std::copy(vec.begin(), vec.end(), it + i);
        }, i));
    }
    
    for (auto &&f : futures) {
        f.get();
    }
}

void serial_attempt() {
    //#4 Changes here to show fair comparison
    std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
    auto it = res.begin();
    for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
        auto vec = Generate(i);
        it = std::copy(vec.begin(), vec.end(), it);
    }
}

int main() {
    using clock = std::chrono::steady_clock;

    std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
    auto begin = clock::now();
    asynchronous_attempt();
    auto end = clock::now();
    std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
    begin = clock::now();
    serial_attempt();
    end = clock::now();
    std::cout << "Duration of Serial Attempt:        " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}

This resulted in the following output:

Theoretical # of Threads: 2
Duration of Multithreaded Attempt:  361149213ns
Duration of Serial Attempt:         364785676ns

Given that this was on an online compiler (here) I'm willing to bet the multithreaded code might win out on a dedicated machine, but I think this at least demonstrates the improvement in performance that we're at least on par between the two methods.

Below are the changes I made, that are ID'd in the code:

We've dramatically increased the number of integers being generated, to force the threads to do actual meaningful work, instead of getting bogged down on OS-level housekeeping
The vector has its size pre-allocated. No more frequent resizing.
Now that the space has been preallocated, we can multithread the copying instead of doing it in serial later.
We have to change the serial code so it also preallocates + copies so that it's a fair comparison.

Now, we've ensured that all the code is indeed running in parallel, and while it's not amounting to a substantial improvement over the serial code, it's at least no longer exhibiting the degenerate performance losses we were seeing before.

score 2 · Answer 3 · answered Aug 27 '19 at 21:30

2

First of all, you are not forcing the std::async to work asynchronously (you would need to specify std::launch::async policy to do so). Second of all, it'd be kind of an overkill to asynchronously create an std::vector of 10 ints. It's just not worth it. Remember - using more threads does not mean that you will see performance benefit! Creating a thread (or even using a threadpool) introduces some overhead, which, in this case, seems to dwarf the benefits of running tasks asynchronously.

^{Thanks @NathanOliver ;>}

answered Aug 27 '19 at 21:30

Fureeish

12,533
4
32
62

I tried with std::launch::async but it made no real difference. I really want to call a function that returns a vector of results from a database query instead. I thought I would experiment with a vector of 10 ints first. – jignatius Aug 27 '19 at 21:38
Concurrency is something to be tested on real-world data, or mocks that mimic it pretty well. While the multithreaded approach won't work for such small tasks, they may introduce significant performance benefits when paired with heavier tasks. Just benchmark your final program. Concurrency is not something that scales well, or even fairly linearly. – Fureeish Aug 27 '19 at 21:44
1

@jacobi creating a thread to make a vector of 10 ints is an incredibly slow way to make them. If you create one thread that generates 1000 vectors of 10 ints then it might be worth to create a thread and run in parallel. Furthermore, in your case, just copying the data takes about the same time as generating it. In non-async version it was probably also optimized better. – ALX23z Aug 27 '19 at 21:52

Using std::async slower than non-async method to populate a vector

3 Answers3

Your array sizes are too small

All your code is bottlenecked by an inherently serial process

Fixing the code