This is my first post here, although I do visit the site regularly and find a lot of valuable information here.
I have an embarrassingly parallel algorithm that I expected would show great performance improvements with multi-threading.
This is my first experience with multi-threading, after quite a bit of reading and review.
I'm working in C++ with VS 2012 and my Windows 7 laptop has an i7 processor with four cores and plenty of memory.
The fundamental work breaks down to this pseudo-code
for (int i = 0; i<iMax; i++){
for (int j = 0; j<jMax; j++){
T[j] += E[j][i] * SF;
}
}
T, E and SF are floats.
The implementation uses a (modified) threadpool from here.
and builds and adds a bunch of tasks for the threadpool from this function
void doWork(float *T, float *E, float SF, int numNodes)
{
// Critical for performance that these loops vectorize.....
for (int nodeCounter = 0; nodeCounter < numNodes; nodeCounter++){
T[nodeCounter] += E[nodeCounter] * SF;
}
};
using this construct,
tp.enqueue(std::bind(&doWork, timeStepDisplacements.T1, T1MODE, T1MPF, numNodes));
in my tests, numNodes is 1,000,000 and I call this routine 3 times (with different arrays) for each of 50 outer loops. I have another loop (100) around the outside of this too, so my test code is generating 15,000 of these tasks with each task carrying out 1,000,000 multiply adds.
EDIT : Corrected outer loop count to 100 and number of tasks from 7,500 to 15,000
When I set up my threadpool with 8, 16 or more threads, the performance is only marginally better than the serial code - say 8.8 seconds v's 9.3.
So my question is why is the performance improvement so small?
NOTE - If use a different task routine (work_proc below) the same threadpool setup shows great performance gains.
void work_proc()
{
int i = 555;
std::random_device rd;
std::mt19937 rng(rd());
// build a vector of random numbers
std::vector<int> data;
data.reserve(100000);
std::generate_n(std::back_inserter(data), data.capacity(), [&](){ return rng(); });
std::sort(data.begin(), data.end());
}
I have no problem posting the entire code - but I figured I'd start with just these key pieces.
Thanx in advance for any insight that be offered.