So essentially you are calculating a value for each lane of a threadgroup, then you want to add them together?
(Going forward I will mostly use the term lane, rather than thread, because IMHO that's less confusing. A lane is like a lane of a SIMD execution unit [eg NEON or AVX].)
First issue is you seem to be confused about the difference between a
- SIMD group (32 lanes of execution that execute in lockstep, and because of this )
- thread group (a collection of SIMD groups that share the local storage of a single GPU core)
Let's say a threadgroup contains 128 lanes.
It could in principle execute as 4 SIMDs at the same time, in lockstep (since a GPU core is built from SIMDs).
It could execute as 4 SIMDs somewhat at the same time, but offset relative to each other by a few cycles. (Since there is no rule that what happens in one quarter of a GPU core is in any way synched with the other three quarters).
It could execute sequentially, 32 lanes at a time, always on the same SIMD (ie quarter core) while the other quarter cores do their own thing.
And so on.
What it can't, however, do is execute on two different GPUs with different local storage.
So you want to do two things.
First, within a SIMD (ie 32 lanes, ie quarter GPU) you want to add the values of each lane, ie you want to perform a reduction.
The way to do this on any modern Apple Silicon GPU is to use a reduction like simd_sum(). (The exact specs for this are in the MSL PDF.)
That will give you the sum (ie the volume) across a particular 32-element threadgroup (ie set of lanes). And will do it in log32 steps rather 32 steps, and without memory traffic back and forth.
There are a variety of reduction primitives, or other instructions for lane-to-lane operations without going through memory.
(Even faster are lane-to-lane operations within a quad, ie a group of four adjacent lanes; so a SIMD is made of eight quads. If you can use the quad-wide communication instructions rather the SIMD-wide communication instructions, better performance!)
At this point we have four sums we wish to add together.
But those sums could be spread over space (quarters of the GPU core) and/or time. This means you HAVE to use something careful to add them.
Before we discuss that, think about where your code went wrong, given the above explanations. The first step, writing the result of each lane to a distinct memory address is fine.
But then in the next step you have what will (depending on exactly what the compiler does, but this will be the best case) eventually turn into four separate SIMD operations that each load from the shared variable sum, add 32 values to it, and then store to shared variable sum. In what order will those loads and stores occur? Who knows. Think about it. If, for example, all four SIMD groups ares scheduled together on each of the four GPU core quadrants, then each will load the initial value of zero, each will add 32 elements, then each will try to store. One of them will "win" in the sense that it will be the last to store, but it will only have about a quarter of the sum! Other times you may (by random luck) get two of the SIMD groups running back to back, so the first correctly stores a quarter, the second loads it and adds, and you actually get about half the sum.
The important points are that
whenever a threadgroup is larger than a SIMD then you have to be VERY careful about any shared mutable data because there are no guarantees as the order in which SIMDs (ie lanes of 32) will read/write that shared mutable data.
barriers could help with this but they are really the wrong tool. You use a barrier for two reasons
- to ensure that everyone has reached a certain point in the computation before the next step (synching in time, no concern with memory) OR
- to ensure that values that will be read by the next step have been written out to memory so that they can be seen by other lanes and other quadrants of the GPU core. (Just executing a store instruction doesn't mean the value is now in the core-local storage! It just means the value has started down a long path that eventually, multiple cycles later, has that value stored in memory and visible to other lanes of the same GPU.)
Given the above, you COULD hack together a loop that would add the four sub-sums correctly to get the final total sum, but there's a much better way, namely to use atomics, as in atomic_fetch_add_explicit.
This will do all the steps necessary in terms of synchronization, so you just have to start by initializing volume to zero then, within each SIMD, do an atomic_fetch_add_explicit on the variable to add in the sub-volume from this SIMD.
(This basic loop will wait four times to add in the four sub values. That's fine when only four values need to be added.
If you care about performance code and have many more threads, you can start to do things like have an array of sub-sums, so that you atomically add some sums to one variable, some to the another, then a second pass of adding all those second-degree sub-sums. The final version of this, of course, would be a tree of sub-sums and requiring only log(N) rather than N synchronization waits.)
So at this point we have solved
- generating a sum within a SIMD (use simd_sum)
- generating a sum within a threadgroup (use atomic_fetch_add_explicit)
Suppose that you use a grid (ie a set of threads larger than a threadgroup, and spread both in time and across multiple GPU cores). Each core can calculate its local sum using the above two techniques, but we have to sum the per-core sums across cores (and of course we have to be careful about synchronization). We can't use atomic_fetch_add_explicit because that's within-core (it uses functionality of the tile/core local memory). What we need is storage that's shared across all the cores, which means the GPU L2. The way we access that (effectively) is we declare an atomic variable (eg atomic_uint or atomic) and add our per-core sum to that atomic variable.
I know this is all overwhelming (and not helped by the fact that every GPU and GPU environment) uses its own language. But I hope that this explanation of what's going on in the hardware helps.
The bottom line is that GPUs are made of SIMDs [groups of 32 executions lanes]), SIMDs are organized into cores (multiple SIMDs sharing local storage), and an SOC has multiple GPU cores sharing a GPU L2. You COULD just use the GPU L2 and an atomic to do everything, to do every sum. But that would be overwhelmingly slow! Metal offers these tiers of different functionality (within-SIMD, within-core, and cross-core) so that you can do most of the work as fast, as in-parallel, and as low energy as possible, only moving to a higher tier when you really need to, and doing as little work as possible at the higher tier.
Looking at the Alloy code (suggested by someone in this thread) looks like they are trying to be super backwards compatible, and so avoiding most of the conveniences introduced in Metal after 1.0. They seem to do a lot of things that are far from best practice in current Metal (which doesn't mean it's bad/dumb code, just that it's what I said, trying to be very backward compatible).
I assume people reading this don't care about being compatible with anything before about an M1 and Metal 3.0, so I've tried to to describe what to do for future-facing code.
You may want to look at
https://developer.apple.com/videos/play/tech-talks/10858
for an overview of what I've said. The whole talk is interesting, but starting from 17 minutes in you get an overview of grids, threadgroups, and SIMD groups, and a discussion of things like the new simd reduction instructions.