Multi-threading slows down my code, even though I've payed attention to these posts:
Multi-threaded GEMM slower than single threaded one?
Why is this OpenMP program slower than single-thread?
I think all the precautions were taken care of:
My CPU is 4 cores + hyperthreading (8 effectively) and I don't run more than 4 threads
Number of vector entries each thread works on seems large enough (2 million per thread). Therefore any false-sharing (cache line problem) should be negligible because most data doesn't overlap with data of other threads.
Entries are consecutive in memory, possibility of a cache miss is very small.
using a
tmp
variable for consecutive operations, instead of assigning values directly into the array.Building in release mode, visual studio
There are no critical points between threads (they don't use mutexes and don't share data)
When measuring time, I am including the creating of a thread. Surely, launching 4 threads can't be that expensive?
1 thread: around 140 milliseconds
4 threads: around 155 milliseconds
Main:
struct MyStruct {
double val = 0;
};
size_t numEntries = 100e4;
size_t numThreads = 4;
std::vector<MyStruct> arr;
void main(){
arr.reserve(numEntries);
for(size_t i=0; i<numEntries; ++i){
MyStruct m{ i };
arr.push_back(m);
}
//run several times
float avgTime=0;
for(size_t n=0; n<100; ++n){
launchThreads(avgTime);
//space out to make avgTime more even:
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
avgTime /= 100;
std::cout << "finished in " << avgTime <<"milliseconds\n";
system("pause");
}
Launching and running the threads:
//ran by each thread
void threadWork(size_t threadId){
size_t numPerThread = (numEntries+numThreads -1) / numThreads;
size_t start_ix = threadId * numPerThread;
size_t endIx;
if (threadId == numThreads - 1) {
endIx = numEntries-1;//we are the last thread
}
else {
endIx = start_ix + numPerThread;
}
for(size_t i=5; i<endIx-5; ++i){
double tmp = arr[i].val;
tmp += arr[i-1].val;
tmp += arr[i-3].val;
tmp += arr[i-4].val;
tmp += arr[i-5].val;
tmp += arr[i-2].val;
tmp += arr[i+1].val;
tmp += arr[i+3].val;
tmp += arr[i+4].val;
tmp += arr[i+5].val;
tmp += arr[i+2].val;
if(tmp > 0){ tmp *= 0.5f;}
else{ tmp *= 0.3f; }
arr[i].val = tmp;
}
}//end()
//measures time
void launchThreads(float &avgTime){
using namespace std::chrono;
typedef std::chrono::milliseconds ms;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0; i <numThreads; ++i) {
std::thread t = std::thread(threadWork, i);
threads.push_back(std::move(t));
}
for (size_t i = 0; i < numThreads; ++i) {
threads[i].join();
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
ms timespan = duration_cast<ms>(t2 - t1);
avgTime += timespan.count();
}