5

I am wondering if the warp scheduling order of a CUDA application is deterministic.

Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device. If not, is there anything that could force ordering of warp execution (say in the case when debugging an order dependent algorithm)?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
NothingMore
  • 1,211
  • 9
  • 19
  • Personally I think, if the kernel has enough warps, there is part of them which is deterministic while the other part is not. By being deterministic here I mean if you run different times, the scheduling order won't change. On a number of NVIDIA GPU devices I know, the first part comprises of the first set of active warps. Just modify ptx code (with %clock and %smid), you will come up with the exact scheduling order. And of course this is just a practical hack and not sure it will work in your case, but it might be fun to give a try. – Zk1001 Jan 19 '15 at 06:34

1 Answers1

9

The precise behavior of CUDA warp scheduling is not defined. Therefore you cannot depend on it being deterministic. In particular, if multiple warps are ready to be executed in a given issue slot, there is no description of which warp will be selected by the warp scheduler(s).

There is no external method to precisely control the order of warp execution.

It's certainly possible to build code that determines warp ID, and forces warps to execute in a particular order. Something like this:

#include <stdio.h>

#define N_WARPS 16
#define nTPB (32*N_WARPS)

__device__ volatile int my_next = 0;
__device__ int warp_order[N_WARPS];

__global__ void my_kernel(){

  __shared__ volatile int warp_num;
  unsigned my_warpid = (threadIdx.x & 0x0FE0U)>>5;
  if (!threadIdx.x) warp_num = 0;
  __syncthreads();  // don't use syncthreads() after this point
  while (warp_num != my_warpid);
  // warp specific code here
  if ((threadIdx.x & 0x01F) == 0){
    warp_order[my_next++] = my_warpid;
    __threadfence();
    warp_num++; // release next warp
    } // could use syncthreads() after this point, if more code follows
}


int main(){

  int h_warp_order[N_WARPS];
  for (int i = 0; i < N_WARPS; i++) h_warp_order[i] = -1;
  cudaMemcpyToSymbol(warp_order, h_warp_order, N_WARPS*sizeof(int));
  my_kernel<<<1,nTPB>>>();
  cudaDeviceSynchronize();
  cudaMemcpyFromSymbol(h_warp_order, warp_order, N_WARPS*sizeof(int));
  for (int i = 0; i < N_WARPS; i++) printf("index: %d, warp_id: %d\n", i, h_warp_order[i]);
  return 0;
}

allowing only one warp to execute at a time will be very inefficient, of course.

In general, the best parallelizable algorithms have little or no order dependence.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Nice answer. Are there any cases where such a hack is useful? – user2398029 Jul 27 '14 at 03:32
  • 1
    I'm not aware of any. There is a concept of ["warp specialization"](http://www.stanford.edu/~mebauer/pdfs/singe.pdf) (you may also want to google that), but I don't think it necessarily implies forcing an a-priori order on warp execution. It relates to inter-warp synchronization, I think. – Robert Crovella Jul 27 '14 at 14:31
  • Thanks a lot, As far as I was aware there was no existing way to do so. However my knowledge of debuggers like TotalView (and other GPU debuggers) is limited in terms of their ability to control GPU warp scheduling. While It would not have surprised me if someone came up with a crazy debugger to actually perform this type of hack automatically, I was pretty sure it didn't exist. As for your comment on order dependence being bad, In general you are correct, the specific algorithm I am dealing with this is not the case (during prime time any order is fine, but makes debugging challenging). – NothingMore Aug 01 '14 at 01:20