GPGPU threading strategy

Asked Jun 25 '21 at 15:51

Active Jun 25 '21 at 15:51

Viewed 66 times

I want to improve the performance of a compute shader.

Each thread group of the shader needs 8 blocks of data, each block has 24 elements.

I’m primarily optimizing for GeForce 1080Ti in my development PC and Tesla V100 in the production servers, but other people also run this code on their workstations, GPUs vary, not necessarily nVidia.

Which way is better:

[numthreads( 24, 1, 1 )], write a loop for( uint i = 0; i < 8; i++ )
This wastes 25% of execution units in each warp, but the memory access pattern is awesome. The VRAM reads of these 24 active threads are either coalesced, or full broadcasts.
[numthreads( 96, 1, 1 )], write a loop for( uint i = groupThreadID / 24; i < 8; i += 4 )
Looks better in terms of execution units utilization, however VRAM access pattern becomes worse because each warp is reading 2 slices of the input data.
Also I’m worried about synchronization penalty of GroupMemoryBarrierWithGroupSync() intrinsic, the group shared memory becomes split over 3 warps.
Also a bit harder to implement.

asked Jun 25 '21 at 15:51

Soonts

20,079
9
57
130

GPGPU threading strategy

0 Answers0