6

A follow up Q to: EarlyExit and DroppedThreads

According to the above links, the code below should dead-lock.
Please explain why this does NOT dead-lock. (Cuda 5 on a Fermi)

__device__ int add[144];
__device__ int result;

add<<<1,96>>>();  // the calling 

__global__ void add() {
 for(idx=72>>1; idx>0; idx>>=1) {
  if(thrdIdx < idx) 
   add[thrdIdx]+= add[thrdIdx+idx];
  else
   return;
  __syncthreads();
 }

 if(thrdIdx == 0)
  result= add[0];
}
Community
  • 1
  • 1
Doug
  • 2,783
  • 6
  • 33
  • 37
  • 1
    what's your launch configuration? (e.g. block && grid dimensions) – alrikai Feb 28 '13 at 22:40
  • 4
    The code you've given wouldn't come close to compiling, much less deadlock. By the way, deadlock due to improper use of syncthreads is a *possibility* not a *guarantee*. The correct way to approach improper use of syncthreads is to conclude that the behavior is [undefined](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions). – Robert Crovella Feb 28 '13 at 22:42

1 Answers1

9

This is technically an ill-defined program.

Most, but not all (for example G80 does not), NVIDIA GPUs support early exit in this way because the hardware maintains an active thread count for each block, and this count is used for barrier synchronization rather than the initial thread count for the block.

Therefore, when the __syncthreads() in your code is reached, the hardware will not wait on any threads that have already returned, and the program runs without deadlock.

A more common use of this style is:

__global__ void foo(int n, ...) {
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  if (idx >= n) return;
  ... // do some computation with remaining threads
}

Important note: barrier counts are updated per-warp (see here), not per-thread. So you may have the case where, say, only a few (or zero) threads return early. This means that the barrier count is not decremented. However, as long as at least one thread from each warp reaches the barrier, it will not deadlock.

So in general, you need to use barriers carefully. But specifically, (simple) early exit patterns like this do work.

Edit: for your specific case.

Iteration Idx==36: 2 active warps so barrier exit count is 64. All threads from warp 0 reach barrier, incrementing count from 0 to 32. 4 threads from warp 1 reach barrier, incrementing count from 32 to 64, and warps 0 and 1 are released from barrier. Read the link above to understand why this happens.

Iteration Idx==18: 1 active warp so barrier exit count is 32. 18 threads from warp 0 reach barrier, incrementing count from 0 to 32. Barrier is satisfied and warp 0 is released.

Etc...

Community
  • 1
  • 1
harrism
  • 26,505
  • 2
  • 57
  • 88
  • 1
    Don't get caught up with what the func is doing, but look at how it is doing it. Other Q's on this subject seem to imply "Oh you can't do that". My experience say's otherwise. This does appear to work reliably, not undefined. I'm trying to understand why, so I can make better use of it. The #'s used above start w/ 2 WARPs and quickly drops to one. Thrds are dropped along the way. The thrds in the 2nd warp, all see early returns. So they don't see the barrier while the 1st WARP is still running. This doesn't seem to agree with your "it will not deadlock" comment. Can you elaborate? – Doug Mar 01 '13 at 15:59
  • Yes, the active threads count, (those that don't return early), and the barrier count are 2 different #'s. You answer from a different point of view: As long as there is at least one active thread in each active warp, it will not deadlock. – Doug Mar 04 '13 at 16:22
  • This would imply a reduction that operates on a warp-size would also not deadlock. i.e. Start w/ up to 32 thrds in each warp doing work. Eventually reducing those 32 thrds down to at least a single thread in each warp still doing work. – Doug Mar 04 '13 at 16:26
  • Yes, and that "warp-synchronous" approach is very common. [See my old presentation on the subject](http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf). – harrism Mar 06 '13 at 00:27
  • Please update the presentation link. It brings me to docs.nvidia.com/cuda/index.html I'm not sure where to find it from there. – Doug Mar 06 '13 at 23:05
  • 1
    That link didn't work for me. I think this is it: http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf – shoelzer Mar 14 '13 at 20:16
  • I think you must be correct, but I am struggling to reconcile your answer with others, namely [this one](http://stackoverflow.com/a/6667067/2778484) and the PTX guide excerpt shown there. Are _active_ threads really the important factor here. – chappjc May 20 '15 at 22:43
  • The reason I considered this subject beyond the official documentation is because of your statement "So in general, you need to use barriers carefully. But specifically, (simple) early exit patterns like this do work." Do you recommend not to do what you have shown? Thanks for clarifying. – chappjc May 25 '15 at 04:53