Question in short
Why am I getting undefined behavior from simd_min
and simd_max
functions in Metal 2.1 with floats?
Update: Seems this only occurs on the Radeon Pro 560X GPU, but not on the Intel UHD Graphics 630.
Background
According to Metal Shading Language Guide Section 5.14, simd_min
and simd_max
functions are supported for common scalar or vector, integer or floating-point types.
For simd_max
, the specification states:
T simd_max(T data)
Returns the maximum value in data across all active threads in the SIMD-group and broadcasts the result to all active threads in the SIMD-group.
Test Case
To test this, I am executing the following test kernel with an input buffer of 128 random floats in range 0..100:
kernel void simdMaxDebugKernel(
const device float *buffer [[ buffer(0) ]],
device float *output [[ buffer(1) ]],
uint id [[ thread_position_in_grid ]])
{
output[id] = simd_max(buffer[id]);
}
By inspection, the 128-valued buffer is divided into two 64-value SIMD groups. Therefore, I would expect that the first and last 64 values in the output would be set to the max value of the first and last SIMD groups respectively.
Test Results
I'm getting some unexpected results:
inputs [simd_float1] 128 values
[0] Float 94.3006362
[1] Float 98.1107177
[2] Float 85.3725891
[3] Float 45.1457863
...
[63] Float 36.5486336
[64] Float 56.5494308
[65] Float 45.6249847
[66] Float 34.8077431
actual [simd_float1] 128 values
[0] Float 94.3006362
[1] Float NaN
[2] Float -3.80461845E+20
[3] Float 0.0000000000000000000000000000000000000212763294
...
[63] Float 0
[64] Float 56.5494308
[65] Float -2467.3457
[66] Float 0.0000000000010178117
...
expectedMax simd_float1 99.4676971
Seems to me that the value at the first SIMD lane for each SIMD group is just copied, and the rest is undefined.
By contrast, the kernel behaves as expected, if a conversion to uint
is used as follows:
output[id] = (float)simd_max((uint)buffer[id]);
→
actual [simd_float1] 128 values
[0] Float 99
[1] Float 99
[2] Float 99
...
[63] Float 99
[64] Float 96
[65] Float 96
...
Test Configuration
- Mac OS 10.14.2 (18C54)
- MacBook Pro (15-inch, 2018)
- Radeon Pro 560X 4096 MB
- XCode Version 10.1 (10B61)