-1

we recently encountered some CUDA memory model related issues when doing cross-CTA communication. We are seeking an authoritative answer from CUDA memory model experts. Specifically, we want to know whether causality order remain transitive across different scopes. The specific case is as follows:

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                    float* result)
{
    // Each block sums a subset of the input array.
    float partialSum = calculatePartialSum(array, N); 

    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory.
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure that the incrementation
        // of the "count" variable is only performed after
        // the partial sum has been written to global memory.
        __threadfence();

        // Thread 0 signals that it is done.
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 determines if its block is the last
        // block to be done.
        isLastBlockDone = (value == (gridDim.x - 1));
    }   

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone.
    __syncthreads();

    if (isLastBlockDone) {

        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0) {

            // Thread 0 of last block stores the total sum
            // to global memory and resets the count
            // varialble, so that the next kernel call
            // works properly.
            result[0] = totalSum;
            count = 0;
        }   
    }   
}

In the above case, the result array is not declared as volatile, so they may be cached incoherently in L1.

So we want to know, according to CUDA memory model, when the last block executes calculateTotalSum in line 36, will it read out other CTA’s partial sum safely?

We suspect this has something to do with causality order transitivity across different scopes: With threadfence(line 18) and atomic operations(line 21), causality order is established in gpu scope between other block's write to result array(line 13)and last block’s write to the isLastBlockDone flag(line 25): write_result_array -> write_isLastBlockDone. With __syncthreads(line 30), causality order is established in cta scope between write isLastBlockDone(line 25) and read result array(line 36): write_isLastBlockDone -> read_result_array.

Can the causality order maintain transitive across different scopes according to the cuda memory model? For example, in our case, does the following causality order hold?: write_result_array -> write_isLastBlockDone -> read_result_array.

Is this code correct according to cuda memory model? Does causality order remain transitive across different scopes?

njuallen
  • 7
  • 3
  • Pretty much this exact pattern can be found in the [`2_Concepts_and_Techniques/threadFenceReduction`](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/2_Concepts_and_Techniques/threadFenceReduction/threadFenceReduction_kernel.cuh#L126-L204) CUDA sample. So if I have not overlooked a difference and Nvidia isn't keeping around outdated samples, it should work – paleonix Jan 11 '23 at 12:36
  • Yes,exactly. In the official CUDA sample, the result buffer is declared volatile. I wonder, if I remove the volatile modifier, will it work correctly? – njuallen Jan 11 '23 at 14:06
  • In what way does the linked sample declare any global buffer as `volatile` ? – Robert Crovella Jan 11 '23 at 14:57
  • The only occurence of `volatile` is in `reduceBlock` (i.e. OP's `calculatePartialSum`) on the shared memory pointer. As OP hasn't provided that function, I don't think he means that one. So "the result buffer is declared volatile" is wrong. Although I'm now wondering myself if that `volatile` on the shared memory pointer is necessary... But thats certainly not OP's question. – paleonix Jan 11 '23 at 15:17
  • I refer to this cuda sample: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions The code is somewhere near section "7.6. Synchronization Functions". In fact, my case is modified from this specific sample, the only difference is the volatile modifier on result global buffer. I want to know, whether the volatile modifier is optional or must? – njuallen Jan 12 '23 at 02:34
  • I personally would not even try to study the question without knowing what calculateTotalSum looks like. I also note this statement immediately prior to that code snippet in the programming guide: "In the code sample below, the visibility of memory operations on the result variable is ensured by declaring it as volatile... " – Robert Crovella Jan 12 '23 at 04:23
  • Yes, I see that statement. If I use volatile, this code is definitely correct. I wonder, if I do not use volatile for the global buffer, is this code correct? I have see some kernels with same usage pattern(volatile also missed) in open source pytorch kernels, maybe this is a common abuse of cuda memory model? – njuallen Jan 12 '23 at 06:12
  • @RobertCrovella I don't see why `calculatePartialSum` is relevant for OPs question. It could just be `return 1.0f` without any impact on how that value is then communicated between blocks or am I overlooking something? – paleonix Jan 12 '23 at 13:29
  • I said `calculateTotalSum` not `calculatePartialSum`. `calculateTotalSum` apparently does its work entirely with `result` in global memory. So what it is doing matters. – Robert Crovella Jan 12 '23 at 15:12
  • @RobertCrovella Sorry, I missread... – paleonix Jan 12 '23 at 15:57
  • @RobertCrovella My interest is piqued. If `calculateTotalSum` is needed to understand the Programming Guide sample, why does it not include it. How does `calculateTotalSum` need to look for the `volatile` to be necessary? Why is it not necessary in the `threadFenceReduction` sample? Should I post a new question? ^^ – paleonix Jan 12 '23 at 16:11
  • 1
    I don't know for certain that it is needed. I said **I** wouldn't spend time on this without knowing what is in there. I never said the documentation is perfect. The existence of this question (and probably many others) seems to make that self-evident. Anyone who wants to see an improvement in the documentation can [file a bug](https://forums.developer.nvidia.com/t/how-to-report-a-bug/67911). Regarding `volatile`, given the uncertainties that may exist, including the `volatile` decorator certainly seems to be the conservative approach – Robert Crovella Jan 12 '23 at 17:15
  • 1
    my sense is that the `volatile` decorator is probably not needed, *given today's definition of CUDA and CUDA GPUs*. This is predicated on a few things: 1. I don't remember and don't wish to test the exact behavior of the L1 in Fermi and Kepler GPUs. 2. GPU designers in the future don't decide to create a GPU that has a write-back L1 or L1 design that would allow for a cacheline to be updated on a write. 3. The aforementioned `calculateTotalSum` isn't doing anything unusual like using warp-synchronous reduction. There are a few other possible caveats perhaps, that are a bit more obscure. – Robert Crovella Jan 12 '23 at 19:52
  • if you want an "authoritative answer" my suggestion is to file a bug, requesting a clarification of the docs – Robert Crovella Jan 12 '23 at 19:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251093/discussion-between-paleonix-and-robert-crovella). – paleonix Jan 12 '23 at 20:46

0 Answers0