Why using an array to store partial result of worker threads and then summing them up is inefficient?

Question

Suppose I want to do a parallel sum of an array, here is the pseudo code:

SplitIntoNParts();
int *result = new int[N]; // this is shared among N worker threads
// worker i store its addition result in result[i]
// since worker thread i only accesses result[i], no race condition happens
for(int i = 0; i < N; i++){
// create a worker thread to do addition in part i, and store the result in result[i]
}
// join threads
// return sum of result array

I read from a lecture on parallel programming (page25)

If array elements happen to share a cache line, this leads to false sharing. – Non-shared data in the same cache line so each update invalidates the cache line … in essence “sloshing independent data” back and forth between threads.

It says sharing the result array is false sharing, but I didn't quite get it.

What does 'each update invalidate the cache line' mean?

What does the 'false sharing' result from?

Does it mean that updates to cached array will invalidate the cache?

Or it means if one core updated its cache line, then it has to invalidate all other cores' cache lines? But if this is the case, even if we use a single variable result to store the result (and use locks to synchronize the addition of result) instead of using an array, this situation still can not be avoided (one core modifying result variable still invalidates all other cores' result)

Read about MESI to get how cache lines are shared. In short, only one core can have a line in the Modified state. Since `result` is an array, `result[0]` and, up to, `result[15]` may share the same line. So when a thread in core 0 writes to `result[0]` it will acquire ownership of the line with `result[0..15]`. A thread that needs `result[1]` need to reacquire ownership of the same line. So there's this ping-ponging of the same line around each core every time an item is read and written. — Margaret Bloom, Oct 18 '22 at 08:33
So the advantage of using a single `result` instead of an array is that the ping-pong process is faster since transport load is lighter? @MargaretBloom — Name Null, Oct 18 '22 at 08:36
If you want efficient, multithreaded computation, you have to make sure that the different threads do not **frequently** work on the same cache line. Step 1: divide the task so each thread can work on its own, local cacheline(s). Step 2: Let them work. Step 3: Put the partial results back together. — SoulKa, Oct 18 '22 at 08:40
For your specific Task this would mean that **each worker thread** has **its own array**. After joining the N threads, the main thread will use the results and return the sum of the N partial sums — SoulKa, Oct 18 '22 at 08:42
A typical way to solve this problem is to add some padding to avoid thread using the same cache line for their own result. It is still not efficient on some platform: on NUMA systems, page can be stored on some specific nodes so thread of other nodes will have a slow access to the target page. The best solution is to use thread local storage and use a clever reduction tree. Runtime like the one of OpenMP should does that for you. Consider using them and not reinvent the wheel if you are not a specialist of the field ;) . — Jérôme Richard, Oct 18 '22 at 09:01

Why using an array to store partial result of worker threads and then summing them up is inefficient?

0 Answers0