I was wondering if anyone might know whether there might be some kind of optimization going on with HLSL InterlockedAdd, specifically when it is used on a single global atomic counter (added value is constant across all threads) by a large number of threads.
Some information I dug up on the web says that atomic adds can create significant contention issues: https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
Granted, the article above is written for CUDA (also a little old dating to 2014), whereas I am interested in HLSL InterlockedAdd. To that end, I wrote a dummy HLSL shader for Unity (compiled to d3d11 via FXC, to my knowledge), where I call InterlockedAdd on a single global atomic counter, such that the added value is always the same across all the shaded fragments. The snippet in question (run in http://shader-playground.timjones.io/, compiled via FXC, optimization lvl 3, shading model 5.0):
**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain()
{
InterlockedAdd(counter[0], 1);
}
----
**Assembly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
atomic_iadd u1, l(0, 0, 0, 0), l(1)
ret
I then slightly modified the code, and instead of always adding some constant value, I now add a value that varies across fragments, so something like this:
**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain(float4 pixel_pos : SV_Position)
{
InterlockedAdd(counter[0], int(pixel_pos.x));
}
----
**Assmebly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
dcl_input_ps_siv linear noperspective v0.x, position
dcl_temps 1
ftoi r0.x, v0.x
atomic_iadd u1, l(0, 0, 0, 0), r0.x
ret
I implemented the equivalents of the aforementioned snippets in Unity, and used them as my fragment shaders for rendering a full-screen quad (granted, there is no output semantics, but that is irrelevant). I profiled the resulting shaders with Nsight Grphics. Suffice to say that the difference between two draw calls was massive, with the fragment shader based on the second snippet (InterlockedAdd with variable value) being considerably slower.
I also made captures with RenderDoc to check the assembly, and they look identical to what is shown above. Nothing in the assembly code suggests such dramatic difference. And yet, the difference is there.
So my question is: is there some kind of optimization taking place when using HLSL InterlockedAdd on a single global atomic counter, such that the added value is a constant? Is it, perhaps, possible that the GPU driver can somehow rearrange the code?
System specs:
- NVIDIA Quadro P4000
- Windows 10
- Unity 2019.4