2

I was wondering if anyone might know whether there might be some kind of optimization going on with HLSL InterlockedAdd, specifically when it is used on a single global atomic counter (added value is constant across all threads) by a large number of threads.

Some information I dug up on the web says that atomic adds can create significant contention issues: https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/

Granted, the article above is written for CUDA (also a little old dating to 2014), whereas I am interested in HLSL InterlockedAdd. To that end, I wrote a dummy HLSL shader for Unity (compiled to d3d11 via FXC, to my knowledge), where I call InterlockedAdd on a single global atomic counter, such that the added value is always the same across all the shaded fragments. The snippet in question (run in http://shader-playground.timjones.io/, compiled via FXC, optimization lvl 3, shading model 5.0):

**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain()
{
    InterlockedAdd(counter[0], 1);
}
----
**Assembly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
atomic_iadd u1, l(0, 0, 0, 0), l(1)
ret 

I then slightly modified the code, and instead of always adding some constant value, I now add a value that varies across fragments, so something like this:

**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain(float4 pixel_pos : SV_Position)
{
    InterlockedAdd(counter[0], int(pixel_pos.x));
}
----
**Assmebly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
dcl_input_ps_siv linear noperspective v0.x, position
dcl_temps 1
ftoi r0.x, v0.x
atomic_iadd u1, l(0, 0, 0, 0), r0.x
ret 

I implemented the equivalents of the aforementioned snippets in Unity, and used them as my fragment shaders for rendering a full-screen quad (granted, there is no output semantics, but that is irrelevant). I profiled the resulting shaders with Nsight Grphics. Suffice to say that the difference between two draw calls was massive, with the fragment shader based on the second snippet (InterlockedAdd with variable value) being considerably slower.

I also made captures with RenderDoc to check the assembly, and they look identical to what is shown above. Nothing in the assembly code suggests such dramatic difference. And yet, the difference is there.

So my question is: is there some kind of optimization taking place when using HLSL InterlockedAdd on a single global atomic counter, such that the added value is a constant? Is it, perhaps, possible that the GPU driver can somehow rearrange the code?

System specs:

  • NVIDIA Quadro P4000
  • Windows 10
  • Unity 2019.4
haykoandri
  • 97
  • 1
  • 9

1 Answers1

1

The pixel shader on the GPU runs pixels in simd groups, called wavefronts. If the code currently executing would not change based on which pixel is being rendered the code only has to be run once for the entire group. If it changes based on the pixel then each of the pixels will need to run unique code.

In the first version, a 64 pixel wavefront would execute the code as a single simd InterlockedAdd<64>(counter[0], 1); or might even optimize it into InterlockedAdd(counter[0], 64); In the second example it turns into a series of serial, non-simd Adds and becomes 64 times as expensive.

This is an oversimplification, and there are other tricks the GPU uses to share computing resources. But a good general rule of thumb is to make as much code as possible sharable by every nearby pixel.

George Davison
  • 103
  • 1
  • 9