Why does allocating a float in Metal's threadgroup address space give different results depending on the hardware?

Question

I have recently been working on a soft-body physics simulation based on the following paper. The implementation uses points and springs and involves calculating the volume of the shape which is then used to calculate the pressure that is to be applied to each point.

On my MacBook Pro (2018, 13") I used the following code to calculate the volume for each soft-body in the simulation since all of the physics for the springs and mass points were being handled by a separate threadgroup:

// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);

// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
    
threadgroup float volume = 0;
    
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
    for (uint i = 0; i < threadsPerThreadgroup; ++i) {
        volume += shared_memory[i];
    }
}
    
// mem_none is probably all that is necessary here.
threadgroup_barrier(mem_flags::mem_none);


// Do calculations that depend on volume.

With shared_memory being passed to the kernel function as a threadgroup buffer:

threadgroup float* shared_memory [[ threadgroup(0) ]]

This worked well until much later on I ran the code on an iPhone and an M1 MacBook and the simulation broke down completely resulting in the soft bodies disappearing fairly quickly after starting the application.

The solution to this problem was to store the result of the volume sum in a threadgroup buffer, threadgroup float* volume [[ threadgroup(2) ]], and do the volume calculation as follows:

 // -*- Volume calculation -*-
    
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
    
threadgroup_barrier(mem_flags::mem_none);
    
if (threadIndexInThreadgroup == 0) {
    auto sum = shared_memory[0];

    for (uint i = 1; i < threadsPerThreadgroup; ++i) {
        sum += shared_memory[i];
    }
        
    *volume = sum;
}
    
threadgroup_barrier(mem_flags::mem_none);  

float epsilon = 0.000001;
float pressurev = rAB * pressure * divide(1.0, *volume + epsilon);

My question is why would the initial method work on my MacBook but not on other hardware and is this now the correct way of doing this? If it is wrong to allocate a float in the threadgroup address space like this then what is the point of being able to do so?

As a side note, I am using mem_flags::mem_none since it seems unnecessary to ensure the correct ordering of memory operations to threadgroup memory in this case. I just want to make sure each thread has written to shared_memory at this point but the order in which they do so shouldn't matter. Is this assumption correct?

I'm pretty sure without the barrier, the threads aren't guaranteed to see the new values, which may explain the difference between different hardware. — JustSomeGuy, Jan 22 '22 at 21:28
Yes changing the barrier to use the threadgroup memory flag was probably the first thing I tried. It makes no difference. In the Metal Shading Language specification, when defining `void threadgroup_barrier(mem_flags flags)` it states that "All threads in a threadgroup executing the kernel must execute this function before any thread can continue execution beyond the `threadgroup_barrier`." The memory flag `mem_none` doesn't mean there isn't a barrier just that the barrier acts only as an execution barrier, which I think is all that is needed in this case. — Eoin Roe, Jan 23 '22 at 00:52
If this was the root of the issue then surely there would also be a problem with the second code snippet I included? However, the second code snippet works perfectly. — Eoin Roe, Jan 23 '22 at 00:55
That might just be a coincidence. Does it work with memory barrier though? — JustSomeGuy, Jan 23 '22 at 03:32
And another suggestion might be trying to make threadgroup with exactly `thread_execution_width` threads to avoid the threadgroup synchronization entirely, since the threads will be executed in lockstep then, but I understand it might not be feasible for your simulation code. — JustSomeGuy, Jan 23 '22 at 03:39
As I was saying I have tried running the code with `mem_flags::mem_none` and `mem_flags::mem_threadgroup` and there is no discernible difference. — Eoin Roe, Jan 23 '22 at 17:52
The debugger is helpful for macOS but unfortunately, it keeps crashing whenever I try and use it when running the code on my iPhone. — Eoin Roe, Jan 23 '22 at 17:53
With regards to the `thread_execution_width` suggestion, I want to emphasise that I'm not looking for a fix here. I've found a way around the issue. I am more just trying to understand what exactly is happening. — Eoin Roe, Jan 23 '22 at 17:57

Maynard Handley · Answer 1 · 2023-03-23T20:19:24.307

So essentially you are calculating a value for each lane of a threadgroup, then you want to add them together? (Going forward I will mostly use the term lane, rather than thread, because IMHO that's less confusing. A lane is like a lane of a SIMD execution unit [eg NEON or AVX].)

First issue is you seem to be confused about the difference between a

SIMD group (32 lanes of execution that execute in lockstep, and because of this )
thread group (a collection of SIMD groups that share the local storage of a single GPU core)

Let's say a threadgroup contains 128 lanes. It could in principle execute as 4 SIMDs at the same time, in lockstep (since a GPU core is built from SIMDs). It could execute as 4 SIMDs somewhat at the same time, but offset relative to each other by a few cycles. (Since there is no rule that what happens in one quarter of a GPU core is in any way synched with the other three quarters). It could execute sequentially, 32 lanes at a time, always on the same SIMD (ie quarter core) while the other quarter cores do their own thing. And so on. What it can't, however, do is execute on two different GPUs with different local storage.

So you want to do two things. First, within a SIMD (ie 32 lanes, ie quarter GPU) you want to add the values of each lane, ie you want to perform a reduction. The way to do this on any modern Apple Silicon GPU is to use a reduction like simd_sum(). (The exact specs for this are in the MSL PDF.) That will give you the sum (ie the volume) across a particular 32-element threadgroup (ie set of lanes). And will do it in log32 steps rather 32 steps, and without memory traffic back and forth. There are a variety of reduction primitives, or other instructions for lane-to-lane operations without going through memory. (Even faster are lane-to-lane operations within a quad, ie a group of four adjacent lanes; so a SIMD is made of eight quads. If you can use the quad-wide communication instructions rather the SIMD-wide communication instructions, better performance!)

At this point we have four sums we wish to add together. But those sums could be spread over space (quarters of the GPU core) and/or time. This means you HAVE to use something careful to add them. Before we discuss that, think about where your code went wrong, given the above explanations. The first step, writing the result of each lane to a distinct memory address is fine. But then in the next step you have what will (depending on exactly what the compiler does, but this will be the best case) eventually turn into four separate SIMD operations that each load from the shared variable sum, add 32 values to it, and then store to shared variable sum. In what order will those loads and stores occur? Who knows. Think about it. If, for example, all four SIMD groups ares scheduled together on each of the four GPU core quadrants, then each will load the initial value of zero, each will add 32 elements, then each will try to store. One of them will "win" in the sense that it will be the last to store, but it will only have about a quarter of the sum! Other times you may (by random luck) get two of the SIMD groups running back to back, so the first correctly stores a quarter, the second loads it and adds, and you actually get about half the sum.

The important points are that

whenever a threadgroup is larger than a SIMD then you have to be VERY careful about any shared mutable data because there are no guarantees as the order in which SIMDs (ie lanes of 32) will read/write that shared mutable data.
barriers could help with this but they are really the wrong tool. You use a barrier for two reasons

to ensure that everyone has reached a certain point in the computation before the next step (synching in time, no concern with memory) OR
to ensure that values that will be read by the next step have been written out to memory so that they can be seen by other lanes and other quadrants of the GPU core. (Just executing a store instruction doesn't mean the value is now in the core-local storage! It just means the value has started down a long path that eventually, multiple cycles later, has that value stored in memory and visible to other lanes of the same GPU.)

Given the above, you COULD hack together a loop that would add the four sub-sums correctly to get the final total sum, but there's a much better way, namely to use atomics, as in atomic_fetch_add_explicit. This will do all the steps necessary in terms of synchronization, so you just have to start by initializing volume to zero then, within each SIMD, do an atomic_fetch_add_explicit on the variable to add in the sub-volume from this SIMD.

(This basic loop will wait four times to add in the four sub values. That's fine when only four values need to be added. If you care about performance code and have many more threads, you can start to do things like have an array of sub-sums, so that you atomically add some sums to one variable, some to the another, then a second pass of adding all those second-degree sub-sums. The final version of this, of course, would be a tree of sub-sums and requiring only log(N) rather than N synchronization waits.)

So at this point we have solved

generating a sum within a SIMD (use simd_sum)
generating a sum within a threadgroup (use atomic_fetch_add_explicit)

Suppose that you use a grid (ie a set of threads larger than a threadgroup, and spread both in time and across multiple GPU cores). Each core can calculate its local sum using the above two techniques, but we have to sum the per-core sums across cores (and of course we have to be careful about synchronization). We can't use atomic_fetch_add_explicit because that's within-core (it uses functionality of the tile/core local memory). What we need is storage that's shared across all the cores, which means the GPU L2. The way we access that (effectively) is we declare an atomic variable (eg atomic_uint or atomic) and add our per-core sum to that atomic variable.

I know this is all overwhelming (and not helped by the fact that every GPU and GPU environment) uses its own language. But I hope that this explanation of what's going on in the hardware helps. The bottom line is that GPUs are made of SIMDs [groups of 32 executions lanes]), SIMDs are organized into cores (multiple SIMDs sharing local storage), and an SOC has multiple GPU cores sharing a GPU L2. You COULD just use the GPU L2 and an atomic to do everything, to do every sum. But that would be overwhelmingly slow! Metal offers these tiers of different functionality (within-SIMD, within-core, and cross-core) so that you can do most of the work as fast, as in-parallel, and as low energy as possible, only moving to a higher tier when you really need to, and doing as little work as possible at the higher tier.

Looking at the Alloy code (suggested by someone in this thread) looks like they are trying to be super backwards compatible, and so avoiding most of the conveniences introduced in Metal after 1.0. They seem to do a lot of things that are far from best practice in current Metal (which doesn't mean it's bad/dumb code, just that it's what I said, trying to be very backward compatible). I assume people reading this don't care about being compatible with anything before about an M1 and Metal 3.0, so I've tried to to describe what to do for future-facing code.

You may want to look at https://developer.apple.com/videos/play/tech-talks/10858 for an overview of what I've said. The whole talk is interesting, but starting from 17 minutes in you get an overview of grids, threadgroups, and SIMD groups, and a discussion of things like the new simd reduction instructions.

score 0 · Answer 2 · answered Sep 26 '22 at 14:22

0

you should use mem_flags::mem_threadgroup, but I think the main problem is metal cant initialize thread group memory to zero like that, the spec is unclear about this

try

threadgroup float volume;
    
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
   volume = 0; 
   for (uint i = 0; i < threadsPerThreadgroup; ++i) {
        volume += shared_memory[i];
    }
}

answered Sep 26 '22 at 14:22

Abhi

1
1

Just tried this. It doesn't work... the correct way to do this is shown below. Also, if you don't initialize `threadgroup float volume` then you get an Xcode warning about the variable being uninitialized whenever the if condition is false. – Eoin Roe Sep 27 '22 at 15:21
hmm ok my fault, my logic came from opengl where the spec explicitly says shared variables are not initialized, see the 'shared variables' section (https://www.khronos.org/opengl/wiki/Compute_Shader), but im still confused why your answer works. Is it because you must declare threadgroup variables before any barriers? – Abhi Sep 28 '22 at 16:03
I forgot to reply to this. I've edited the answer since there was a mistake with the for loop where it was starting from 0 instead of 1. This didn't cause a problem at runtime for me but was still incorrect. In terms of why this works, I am not quite sure but I noticed this pattern of initializing the threadgroup variable with the first element of a threadgroup buffer before accumulating in many places in the Alloy code base - https://github.com/s1ddok/Alloy/blob/master/Sources/Alloy/Shaders/Shaders.metal – Eoin Roe Nov 17 '22 at 14:09

Eoin Roe · Answer 3 · 2022-11-17T14:02:24.790

If you don't want to use a threadgroup buffer, the correct way to do this is the following:

// -*- Volume calculation -*-
threadgroup float volume = 0;

// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);

// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);

if (threadIndexInThreadgroup == 0) {
    volume = shared_memory[0];
    
    // Index memory use signed int types over unsigned.
    for (int i = 1; i < int(threadsPerThreadgroup); ++i) {
        volume += shared_memory[i];
    }
}

threadgroup_barrier(mem_flags::mem_none);

You can use either threadgroup_barrier(mem_flags::mem_none) and threadgroup_barrier(mem_flags::mem_threadgroup), it appears to make no difference.

Why does allocating a float in Metal's threadgroup address space give different results depending on the hardware?

3 Answers3