Im attempting to optimize a compute shader that calculates some values from texture samples, and uses atomic operations to increment counters to a buffer, very similar to the following answer:
https://stackoverflow.com/a/68076730/5510818
kernel void compute(texture2d<half, access::read> inTexture [[texture(0)]],
volatile device atomic_uint *samples [[buffer(0)]],
ushort2 position [[thread_position_in_grid]])
{
// Early bail
if ( position.x >= inTexture.get_width() || position.y >= inTexture.get_height() )
{
return;
}
half3 color = inTexture.read(position).rgb;
// do some math here
// increment
atomic_fetch_add_explicit( &( samples[offset] ), uint32_t( somevalue ), memory_order_relaxed );
And part of my encoder on obj-c:
NSUInteger w = self.pass1PipelineState.threadExecutionWidth;
NSUInteger h = self.pass1PipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadsPerThreadGroup = MTLSizeMake(w, h, 1);
MTLSize threadsPerGrid = MTLSizeMake(frameMPSImage.width, frameMPSImage.height, 1);
[pass1Encoder dispatchThreads:threadsPerGrid threadsPerThreadgroup:threadsPerThreadGroup];
In an attempt to optimize, I am curious if I can leverage texture gather operations.
My understanding is that gather will fetch 4 samples 'about' the thread position in grid - and that it does so in an optimal manner. Am I right in understanding that I could in theory optimize this by fetching via gather, and doing 4x compute in my kernel, and write out 4x from a single thread group?
I would have to ensure that my thread width and height in metal passed to the encoder ensures I don't duplicate work (ie / 4 ?)
Something like:
kernel void compute(texture2d<half, access::read> inTexture [[texture(0)]],
volatile device atomic_uint *samples [[buffer(0)]],
ushort2 position [[thread_position_in_grid]])
{
// Early bail
if ( position.x >= inTexture.get_width() || position.y >= inTexture.get_height() )
{
return;
}
vec4<half3> colorGather = inTexture.gather(position).rgb;
color1 = half3[0]
// do some math here
color2 = half3[1]
// do some math here
color3 = half3[2]
// do some math here
color4 = half3[3]
// do some math here
// increment 4x
atomic_fetch_add_explicit( &( samples[offset1] ), uint32_t( somevalue1 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset2] ), uint32_t( somevalue2 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset3] ), uint32_t( somevalue3 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset4] ), uint32_t( somevalue4 ), memory_order_relaxed );
Am I understanding gather correctly?
Are there any publicly available examples of gather? I cannot seem to find any!
Is there a way to do a mutex lock about the buffer so I am not locking 4x in the above code?
Am I correctly understanding needing to adjust my obj-c encoder pass to account for the fact I'd be sampling 4x in the shader?
Thank you.