There are multiple factors that are causing the Multithreaded code to perform (much) slower than the Singlethreaded code.
Your array sizes are too small
Multithreading often has negligible-to-no effect on datasets that are particularly small. In both versions of your code, you're generating 2000 integers, and each Logical Thread (which, because std::async
is often implemented in terms of thread pools, might not be the same as a Software Thread) is only generating 10 integers. The cost of spooling up a thread every 10 integers way offsets the benefit of generating those integers in parallel.
You might see a performance gain if each thread were instead responsible for, say, 10,000 integers each, but you'll probably instead have a different issue:
All your code is bottlenecked by an inherently serial process
Both versions of the code copy the generated integers into a host vector. It would be one thing if the act of generating those integers was itself a time consuming process, but in your case, it's likely just a matter of a small, fast bit of assembly generating each integer.
So the act of copying each integer into the final vector is probably not inherently faster than generating each integer, meaning a sizable chunk of the "work" being done is completely serial, defeating the whole purpose of multithreading your code.
Fixing the code
Compilers are very good at their jobs, so in trying to revise your code, I was only barely able to get multithreaded code that was faster than the serial code. Multiple executions had varying results, so my general assessment is that this kind of code is bad at being multithreaded.
But here's what I came up with:
#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>
//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;
std::vector<int> Generate(int i) {
std::vector<int> v;
for (int j = i; j < i + BLOCK_SIZE; ++j) {
v.push_back(j);
}
return v;
}
void asynchronous_attempt() {
std::vector<std::future<void>> futures;
//#2: Preallocated Vector
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
{
futures.push_back(std::async(
[it](int i) {
auto vec = Generate(i);
//#3 Copying done multithreaded
std::copy(vec.begin(), vec.end(), it + i);
}, i));
}
for (auto &&f : futures) {
f.get();
}
}
void serial_attempt() {
//#4 Changes here to show fair comparison
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
auto vec = Generate(i);
it = std::copy(vec.begin(), vec.end(), it);
}
}
int main() {
using clock = std::chrono::steady_clock;
std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
auto begin = clock::now();
asynchronous_attempt();
auto end = clock::now();
std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
begin = clock::now();
serial_attempt();
end = clock::now();
std::cout << "Duration of Serial Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}
This resulted in the following output:
Theoretical # of Threads: 2
Duration of Multithreaded Attempt: 361149213ns
Duration of Serial Attempt: 364785676ns
Given that this was on an online compiler (here) I'm willing to bet the multithreaded code might win out on a dedicated machine, but I think this at least demonstrates the improvement in performance that we're at least on par between the two methods.
Below are the changes I made, that are ID'd in the code:
- We've dramatically increased the number of integers being generated, to force the threads to do actual meaningful work, instead of getting bogged down on OS-level housekeeping
- The vector has its size pre-allocated. No more frequent resizing.
- Now that the space has been preallocated, we can multithread the copying instead of doing it in serial later.
- We have to change the serial code so it also preallocates + copies so that it's a fair comparison.
Now, we've ensured that all the code is indeed running in parallel, and while it's not amounting to a substantial improvement over the serial code, it's at least no longer exhibiting the degenerate performance losses we were seeing before.