I want to improve the performance of a compute shader.
Each thread group of the shader needs 8 blocks of data, each block has 24 elements.
I’m primarily optimizing for GeForce 1080Ti in my development PC and Tesla V100 in the production servers, but other people also run this code on their workstations, GPUs vary, not necessarily nVidia.
Which way is better:
[numthreads( 24, 1, 1 )]
, write a loopfor( uint i = 0; i < 8; i++ )
This wastes 25% of execution units in each warp, but the memory access pattern is awesome. The VRAM reads of these 24 active threads are either coalesced, or full broadcasts.[numthreads( 96, 1, 1 )]
, write a loopfor( uint i = groupThreadID / 24; i < 8; i += 4 )
Looks better in terms of execution units utilization, however VRAM access pattern becomes worse because each warp is reading 2 slices of the input data.
Also I’m worried about synchronization penalty ofGroupMemoryBarrierWithGroupSync()
intrinsic, the group shared memory becomes split over 3 warps.
Also a bit harder to implement.