1

I'm going through the parallel reduction example from Nvidia. If tid < 32 then the threads are all meant to be in the same warp and so the instructions are suppose to be SIMD synchronous, so we can assume that sdata[tid] += sdata[tid + 32]; completes for all threads before sdata[tid] += sdata[tid + 16]; and so on. But this is not happening for me.

for (unsigned int s=groupDim_x/2; s>32; s>>=1) 
{ 
    if (tid < s) sdata[tid] += sdata[tid + s]; 
    GroupMemoryBarrierWithGroupSync(); 
}
if (tid < 32)
{ 
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid +  8]; 
    sdata[tid] += sdata[tid +  4];
    sdata[tid] += sdata[tid +  2];
    sdata[tid] += sdata[tid +  1]; 
}

The solution to the same problem on Cuda has already been posted (see), but it uses pointers and the volatile keyword. Directcompute doesn't have pointers and doesn't allow the volatile keyword on global memory.

Tom Huntington
  • 2,260
  • 10
  • 20
  • What hardware are you using? Note that the NVidia example assumes that you use a GPU with a warp size of at least 32 - which is true for most hardware, but not guaranteed to be true for all hardware. Especially Intels integrated GPUs tend to have warp sizes of 4 (at least the ones I tested so far) – Bizzarrus Sep 13 '20 at 15:01
  • Sorry I should have mentioned that. Nvidia – Tom Huntington Sep 17 '20 at 05:48

1 Answers1

2

Directcompute doesn't have pointers and doesn't allow the volatile keyword on global memory.

Indeed, but it exposes comparable functionality as intrinsic functions. Replace += in your loop with InterlockedAdd intrinsic function and see what happens. However that function only works on integers.

Soonts
  • 20,079
  • 9
  • 57
  • 130