Optimizing Metal Compute - texture sampling using Gather?

Question

Im attempting to optimize a compute shader that calculates some values from texture samples, and uses atomic operations to increment counters to a buffer, very similar to the following answer:

https://stackoverflow.com/a/68076730/5510818

kernel void compute(texture2d<half, access::read>  inTexture  [[texture(0)]],
                           volatile device atomic_uint *samples [[buffer(0)]],
                           ushort2 position [[thread_position_in_grid]])
{

// Early bail
if ( position.x >= inTexture.get_width() || position.y >= inTexture.get_height() )
{
    return;
}

half3 color = inTexture.read(position).rgb;
// do some math here
// increment
atomic_fetch_add_explicit( &( samples[offset] ), uint32_t( somevalue ), memory_order_relaxed );

And part of my encoder on obj-c:

NSUInteger w  = self.pass1PipelineState.threadExecutionWidth;
NSUInteger h  = self.pass1PipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadsPerThreadGroup = MTLSizeMake(w, h, 1);

MTLSize threadsPerGrid = MTLSizeMake(frameMPSImage.width, frameMPSImage.height, 1);

[pass1Encoder dispatchThreads:threadsPerGrid threadsPerThreadgroup:threadsPerThreadGroup];

In an attempt to optimize, I am curious if I can leverage texture gather operations.

My understanding is that gather will fetch 4 samples 'about' the thread position in grid - and that it does so in an optimal manner. Am I right in understanding that I could in theory optimize this by fetching via gather, and doing 4x compute in my kernel, and write out 4x from a single thread group?

I would have to ensure that my thread width and height in metal passed to the encoder ensures I don't duplicate work (ie / 4 ?)

Something like:

 kernel void compute(texture2d<half, access::read>  inTexture  [[texture(0)]],
                           volatile device atomic_uint *samples [[buffer(0)]],
                           ushort2 position [[thread_position_in_grid]])
{

// Early bail
if ( position.x >= inTexture.get_width() || position.y >= inTexture.get_height() )
{
    return;
}

vec4<half3> colorGather = inTexture.gather(position).rgb;
color1 = half3[0]
// do some math here
color2 = half3[1]
// do some math here
color3 = half3[2]
// do some math here
color4 = half3[3]
// do some math here

// increment 4x
atomic_fetch_add_explicit( &( samples[offset1] ), uint32_t( somevalue1 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset2] ), uint32_t( somevalue2 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset3] ), uint32_t( somevalue3 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset4] ), uint32_t( somevalue4 ), memory_order_relaxed );

Am I understanding gather correctly?

Are there any publicly available examples of gather? I cannot seem to find any!

Is there a way to do a mutex lock about the buffer so I am not locking 4x in the above code?

Am I correctly understanding needing to adjust my obj-c encoder pass to account for the fact I'd be sampling 4x in the shader?

Thank you.

Optimizing Metal Compute - texture sampling using Gather?

0 Answers0