std::async having far worse performance than single threaded

Question

I'm currently playing a bit with std::async since I read that it performs better than std::thread. I wrote a simple program with a function ("waitsome") that takes on my computer roughly 500ms to compute. If I feed this into std::async however (and compute it 16 times instead of once) it takes a whopping 50s.

I already found out that the destructor of the future may block such that you should assure it is either assigned or moved if it is in a limited scope. hence I std::move the future into the vector holding the futures. Other than that I have no real idea. I used the "very sleepy" profiler to check on which function wastes all the time and got this image:

Please find the source code below. Platform is windows, compiler is VS2022 (invoked from vscode). Do I have a general misconception of std::async ? Essentially I want to create worker threads and get the results as std::futures.

#include <iostream>
#include <future>
#include <thread>
#include <chrono>

constexpr unsigned int highestSequence = 1000000;

void waitsome();

int main(int argc, char** argv)
{
  auto startTime = std::chrono::high_resolution_clock::now();

  // parallel portion
  std::vector<std::future<void>> futVec;
  for(unsigned int i = 0; i < 16; i++)
  {
    futVec.push_back(std::move(std::async(std::launch::async, &waitsome)));
  }

  for(unsigned int i = 0; i < futVec.size(); i++)
  {
    futVec.at(i).wait();
  }
  
  auto stopTime = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double, std::milli> computationTime = stopTime - startTime;
  std::cout << "async computation took " << computationTime.count() << " ms" << std::endl;
  
  // sequential portion
  startTime = std::chrono::high_resolution_clock::now();
  waitsome();
  stopTime = std::chrono::high_resolution_clock::now();
  
  std::chrono::duration<double, std::milli> computationTimeSingle = stopTime - startTime;
  std::cout << "single computation took " << computationTimeSingle.count() << " ms" << std::endl;
}

void waitsome()
{
  unsigned int a = 1;
  for(unsigned int i = 0; i < 1000000; i++)
  {
    std::vector<unsigned int> myvec;
    a += 2.0;
    myvec.push_back(a);
  }
  return;
}

The output of 3 consecutive runs looks like this: async computation took 52775.2 ms
single computation took 498.063 ms
async computation took 52890.9 ms
single computation took 502.281 ms
async computation took 52680.8 ms
single computation took 516.881 ms

Your test is invalid. If you didn't compile with optimizations turned on, you can't surmise this is production behavior. If you did compile with optimizations turned on, then the compiler would have optimized `waitsome` to be a no-op. https://godbolt.org/z/Wdrzxav4W — selbie, Aug 06 '23 at 15:39
Also your parallel test does 16x times the work of the sequential one, so of course it is not going to be faster. It is _much_ slower probably either because your system doesn't have 16 free CPU cores to run all of theses threads in parallel or because your test code basically consists only of repeated memory allocation and deallocation, which need (to varying degree) synchronization across threads and will therefore not generally be performant in parallel. — user17732522, Aug 06 '23 at 15:57
"_std::async since I read that it performs better than std::thread_": No idea where you got this from, but it isn't true, at least for your use case. `std::async` is mostly a convenience wrapper around `std::thread` to return values. Technically `std::async` might be implemented as a thread pool which may have benefits, but it would affect overhead of starting a threaded function only, not the time the function actually runs. — user17732522, Aug 06 '23 at 15:59
Your code runs in 10s of milliseconds on my PC using VS 2022 in release mode. Don't measure performance of debug builds especially template heavy code like `std::async` and `std::vector` — Alan Birtles, Aug 06 '23 at 16:02
The trick to getting good perf with threads is to eliminate as much as possible things that block between threads. Like when vectors are resized automatically instead of initializing the thread capacity to what's needed. `myvec.reserve(1000000) at the start of waitsome() before the loop. Huge improvement. — doug, Aug 06 '23 at 16:55
Unrelated. `std::move()` in your code does nothing and is useless. — Evg, Aug 06 '23 at 17:35
@selbie Thanks, I did not believe that optimizations would actually reduce computation time from 50s to a few 100ms... but they did. Mentally noted to never test for speed in debug build again :) — MAPster, Aug 06 '23 at 19:25

std::async having far worse performance than single threaded

0 Answers0