Why does the performance of this embarrassingly parallel algorithm not improve with multi-threading?

Question

This is my first post here, although I do visit the site regularly and find a lot of valuable information here.

I have an embarrassingly parallel algorithm that I expected would show great performance improvements with multi-threading.

This is my first experience with multi-threading, after quite a bit of reading and review.

I'm working in C++ with VS 2012 and my Windows 7 laptop has an i7 processor with four cores and plenty of memory.

The fundamental work breaks down to this pseudo-code

for (int i = 0; i<iMax; i++){
    for (int j = 0; j<jMax; j++){
        T[j] += E[j][i] * SF;
    }
}

T, E and SF are floats.

The implementation uses a (modified) threadpool from here.

and builds and adds a bunch of tasks for the threadpool from this function

void doWork(float *T, float *E, float SF, int numNodes)
{
    // Critical for performance that these loops vectorize.....
    for (int nodeCounter = 0; nodeCounter < numNodes; nodeCounter++){
        T[nodeCounter] += E[nodeCounter] * SF;
    }
};

using this construct,

tp.enqueue(std::bind(&doWork, timeStepDisplacements.T1, T1MODE, T1MPF, numNodes));

in my tests, numNodes is 1,000,000 and I call this routine 3 times (with different arrays) for each of 50 outer loops. I have another loop (100) around the outside of this too, so my test code is generating 15,000 of these tasks with each task carrying out 1,000,000 multiply adds.

EDIT : Corrected outer loop count to 100 and number of tasks from 7,500 to 15,000

When I set up my threadpool with 8, 16 or more threads, the performance is only marginally better than the serial code - say 8.8 seconds v's 9.3.

So my question is why is the performance improvement so small?

NOTE - If use a different task routine (work_proc below) the same threadpool setup shows great performance gains.

void work_proc()
{
    int i = 555;
    std::random_device rd;
    std::mt19937 rng(rd());

    // build a vector of random numbers
    std::vector<int> data;
    data.reserve(100000);
    std::generate_n(std::back_inserter(data), data.capacity(), [&](){ return rng(); });
    std::sort(data.begin(), data.end());
}

I have no problem posting the entire code - but I figured I'd start with just these key pieces.

Thanx in advance for any insight that be offered.

Incrementing `j` in the inner loop may mean lots of cache misses. Maybe try refactoring the loop to be more cache friendly. — Jonathan Potter, Jan 07 '16 at 01:22
Is your OS running threads on separate cores or the same core? — Thomas Matthews, Jan 07 '16 at 01:22
Does each core have a separate floating point processor or hardware assist? — Thomas Matthews, Jan 07 '16 at 01:23
Jonathan - The actual implementation uses single dimensional arrays and is designed to make sure i get good vectorization of the loop. — max375, Jan 07 '16 at 01:26
Please try to provide a complete example that we can compile and run. I find it very hard to reason about this with that little context. — 5gon12eder, Jan 07 '16 at 01:26
Thomas - not sure about how to answer your question. I'm running C++ using VS 2012 on a Windows 7 OS and a four core i7 processor. So I'm assuming the OS is distributing the threads across the different cores - looking at the performance monitor i can see the CPU utilization of all threads is maxed out, — max375, Jan 07 '16 at 01:30
Note that the time your CPU spends waiting for memory counts as CPU utilization. — Matt Timmermans, Jan 07 '16 at 01:32
Matt - Understood on memory waits counting as CPU utilization. I'm trying to use the VS performance tools to diagnose if this is the problem - but I haven't figured out how to interpret the output yet...... — max375, Jan 07 '16 at 01:36
5gon12eder - what's the best way to provide example code? I can't see how to attach a zip file.... — max375, Jan 07 '16 at 01:43

Matt Timmermans · Accepted Answer · 2016-01-07T02:35:47.253

4

You may have glossed over some important bits, but if your pseudo-code is accurate, then it looks like the bottleneck is memory access.

A single core can add numbers fast enough to keep your DRAM pretty much fully utilized, so there's not much performance to be gained by splitting that work up.

EDIT: You can calculate your DRAM transfer rate if you know your DRAM type and I/O clock rate. Is that about how fast it goes?

For example: 15000*1000000 floats in 9.3 seconds is 6.4 GB/s for the reads. If you're writing the same amount, then that's 12.8 GB/s, which is the maximum rate for the DDR3-1600 that you say you're using in comments...

So that is certainly your problem.

Note that you should not really need to write the same amount, so if you restructure the algorithm to be more cache friendly, you may make it almost twice as fast on your box.

If you have each worker do 4 Es, like:

T[nodeCounter] += (E1[nodeCounter] + E2[nodeCounter] + E3[nodeCounter] + E4[nodeCounter])*SF

then that will reduce your T bandwidth significantly, and get you pretty close to the maximum speed.

edited Jan 07 '16 at 02:35

answered Jan 07 '16 at 01:21

Matt Timmermans

53,709
3
46
87

Matt - Trying to figure out my hardware specs now,,,,, The CPU is an i7-4710MQ at 2.5GHz. Not obvious where I find the other info - but I'm looking... – max375 Jan 07 '16 at 01:52
Memory is DDR3L @ 1600 MHz – max375 Jan 07 '16 at 01:58
Can you test using [nodecounter&255] instead of [nodecounter]? That will remove any memory bandwidth problem (although it may introduce contention, so try it with 1 and then 2 threads) – Matt Timmermans Jan 07 '16 at 02:08
Matt - I'll think about restructuring for cache friendliness - although i have other competing considerations unfortunately..... But I think I could re-structure. The problem is I can"t see a way of getting better caching for both E and T. – max375 Jan 07 '16 at 02:14
It looks like there's not much you can to about E, since it's huge and you need to read each part only once. But you can make sure you don't overwrite the T[s] a whole bunch of times... or at least make sure you finish all the writes to one before it gets kicked out of the cache. – Matt Timmermans Jan 07 '16 at 02:23
Edited the answer to reflect your new numbers -- everything matches up now – Matt Timmermans Jan 07 '16 at 02:27
Matt - I rechecked and my outer loop in the test is 100, not 50. So the observed transfer rate is also doubled. So this looks more and more like a memory access bottleneck..... which sucks for me :( – max375 Jan 07 '16 at 02:27
I don't get the "[nodecounter&255]" request? What is this expected to do? – max375 Jan 07 '16 at 02:30
it'll make all the memory that the worker accesses fit into the L1 or L2 cache (only 256 floats), so it doesn't have to go to RAM nearly as much – Matt Timmermans Jan 07 '16 at 02:32
Matt - I modified the worker code to process four E's per T and the elapsed time dropped from ~9.3 secs to ~4,7 secs!!!! Then I modified the "nodeCounter" to "nodeCounter&255" and the elapsed time dropped to ~1.7 Seconds!!!!! So it's clear to me that the issue without doubt was memory bandwidth and the only improvements that would work are related to improving caching usage. Thanx for the help - this level of performance makes an "on the fly" calculation strategy feasible for my project. – max375 Jan 08 '16 at 01:00
One last question - can you point me to a reference for the technique that forces the loop data to cache? – max375 Jan 08 '16 at 01:08
note that the &255 thing gives you the wrong anwser! It just demonstrates that the bottleneck really is memory bandwidth. Sorry, I don't have a reference for this stuff -- just experience – Matt Timmermans Jan 08 '16 at 01:52

Why does the performance of this embarrassingly parallel algorithm not improve with multi-threading?

1 Answers1