Are atomic operations in CUDA guaranteed to be scheduled per warp?

Question

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.

Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.

However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.

From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.

For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.

The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations. A simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. — Robert Crovella, Apr 12 '17 at 22:58
Furthermore, modern GPUs are capable of issuing multiple instructions per clock cycle, within the same SM. Therefore without supposing any more guaranteed details about GPU SM arch (such as "there are a limited number of resources which can execute an atomic", or, "only one atomic can be scheduled per clock cycle, per SM") then it would be evident that two warps could both schedule an atomic *in the same cycle*. In that case we could also suppose lockstep execution is not possible even for separate instructions, e.g. if a thread from each warp targeted the same location. — Robert Crovella, Apr 12 '17 at 23:07
Just a note to OP and other readers - remember the GTX 970 has a hardware bug regarding memory access of part of its last GB of memory; even if that doesn't effect OP, one should generally avoid using that card for experimentation involving memory access. — einpoklum, Apr 14 '17 at 07:51
@RobertCrovella, if atomic operations are not guaranteed to be scheduled per warp are there any other ways to guarantee that a vector of several 32 bit integers can be written to global memory atomically? — TheDutchDevil, Apr 14 '17 at 09:47
@einpoklum, I also executed the same code on a Maxwell Titan X with pretty much the same output, so the bug in the memory of the GTX 970 definitely doesn't have any effect on this small test program. Which is to be expected as I only allocate 32 ints in global device memory. — TheDutchDevil, Apr 14 '17 at 09:51
@TheDutchDevil well I have to be careful about what I say, but I would suggest that "yes there are other ways" but perhaps what I am thinking about is not what you are thinking about. If you are asking "is there a way that I can issue a single instruction across a warp and have that result in a coherent write of up to thirty-two 32-bit integers" I would say no, I don't know of such a thing. OTOH if you want to make sure that a vector is coherent, you could use reduction based approaches (preferred) or possibly a "critical section" to arrange for that. Critical sections can be troublesome. — Robert Crovella, Apr 14 '17 at 11:40
@RobertCrovella, I'm modifying an application that discovers many small vectors in parallel, and that writes these vectors to a hashtable in global memory. Since constant access time after a vector has been discovered is important. Given the fact that the vectors need to written into a hashtable and the discovery is on-going I don't see a reduction working. Equally so I'm afraid a critical section will have a large performance impact. — TheDutchDevil, Apr 14 '17 at 13:36
As you discover these vectors, are you intending to write each to its own location in global memory, or do you expect vectors to be written on top of each other? If you are writing on top of each other, how do you decide whether to write or not? I guess your needs are still unclear. Questions like this often end up being an X-Y problem, and if you describe your needs (at a higher level, rather than "I need a coherent vector write"), it may be that someone can suggest something useful. — Robert Crovella, Apr 14 '17 at 14:29
If we were to follow your description in the question, for example, we could trivially reduce that problem to writing a single value, which can easily be done atomically. Use the single value of "0" or "1" as a proxy for "the array is all zero" or "the array is all 1". I imagine that is not what you want. In addition to my question above, it would be useful to know whether the array length is actually 32. If it could be a shorter length, or somehow packed into a shorter length, such as 16 or 32 bytes, then there may be other approaches that are relatively straightforward. — Robert Crovella, Apr 14 '17 at 14:55
@RobertCrovella, The vectors that are discovered are state vectors that are found during on the fly state space exploration of product automaton's. Usually these vectors are only 3 or 4 32 bit integers long, so while they are not the size of a warp they are usually less than 32 bytes. Vectors will never be written on top of each other, as the hashtable is used to keep track of states that have already been discovered. So it is only a question of finding an empty slot (a continuous strip of empty cells in the global memory array) and then atomically writing the vector into the slot. — TheDutchDevil, Apr 14 '17 at 15:20
OK you just need coherent read/write, you don't really need atomics across the vector. You can manage coherent read/write with atomic access control, which is easily accomplished with ordinary atomics. — Robert Crovella, Apr 14 '17 at 15:52

score 3 · Accepted Answer · answered Apr 14 '17 at 23:28

The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.

I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.

If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.

However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.

This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.

What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:

vector is empty
vector is being filled
vector is filled, ready for "consumption"

with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:

#include <assert.h>
#include <iostream>
#include <stdio.h>

const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64;  // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0};      // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;

static __device__ __inline__ int __mysmid(){
  int smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  return smid;}


// this code won't work on a GPU with a single SM!
__global__ void kernel(){

  __shared__ volatile int done, update, next_slot;
  int my_block_id = atomicAdd(&block_id, 1);
  int my_sm = __mysmid();
  if (my_block_id == 0){
    if (!threadIdx.x){
      restricted_sm = my_sm;
      __threadfence();
      // I am "block 0" and process the vectors, checking for coherency
      // "consumer"
      next_slot = 0;
      volatile int *vslot_control = slot_control;
      volatile int *vslots = slots;
      int scount = 0;
      while(scount < max_act){
        if (vslot_control[next_slot] == slot_full){
          scount++;
          int slot_val = vslots[next_slot*slot_length];
          for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
          observations[slot_val]++;
          vslot_control[next_slot] = slot_empty;
          correct++;
          __threadfence();
          }
        next_slot++;
        if (next_slot >= num_slots) next_slot = 0;
        }
      }}
  else {
    // "producer"
    while (restricted_sm < 0);  // wait for signaling
    if (my_sm == restricted_sm) return;
    next_slot = 0;
    done = 0;
    __syncthreads();
    while (!done) {
      if (!threadIdx.x){
        while (atomicCAS(slot_control+next_slot, slot_empty,  slot_filling) > slot_empty) {
          next_slot++;
          if (next_slot >= num_slots) next_slot = 0;}
        // we grabbed an empty slot, fill it with my_sm
        if (atomicAdd(&num_act, 1) < max_act)   update = 1;
        else {done = 1; update = 0;}
        }
      __syncthreads();

      if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
      __threadfence(); //enforce ordering
      if ((update) && (!threadIdx.x)){
        slot_control[next_slot] = 2; // mark slot full
        atomicAdd(actives+my_sm, 1);}
      __syncthreads();
    }
  }
}

int main(){

  kernel<<<256, slot_length>>>();
  cudaDeviceSynchronize();
  cudaError_t res= cudaGetLastError();
  if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
  int *h_obs = new int[max_sm];
  int *h_act = new int[max_sm];
  int h_correct;
  cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
  cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
  cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
  int h_total_act = 0;
  int h_total_obs = 0;
  for (int i = 0; i < max_sm; i++){
    std::cout << h_act[i] << "," << h_obs[i] << " ";
    h_total_act += h_act[i];
    h_total_obs += h_obs[i];}
  std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}

I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

Thanks for the answer! Especially the fact that atomics are indeed not scheduled per warp helps us out a ton! However, we are going to go in a different direction to ensure that vectors can be written atomically, without actually using an access control array that scales linearly compared to the hashtable. — TheDutchDevil, Apr 28 '17 at 20:29

Are atomic operations in CUDA guaranteed to be scheduled per warp?

1 Answers1