Syncthread and global memory

Question

I have written a small kernel do sum 2^k elements using parallel reduction. Nothing new here....My vector is stored in global memory, I assign each part of the vector to a different block and reduce each block to a single position. The rest I do in CPU.

__global__ void sum(real *v, long int s){

    long int ix     =  threadIdx.x;
    long int shift = blockIdx.x*blockDim.x;

    long int h = blockDim.x/2;
    while (h >= 1){
        if (ix < h){
            v[ix +  shift] = v[2*ix + shift] + v[2*ix + 1 + shift];
        }
        __syncthreads(); 
        h = h / 2;
    }
}

The code works. However, after careful inspection, I realized that maybe it should not work. So I am confused.... It could be that thread_id = 1, which sums elements 2 and 3, writes this its sum to position 1 before thread_id = 0 is able to read elements 0 and 1. Thus making the result invalid.

I would have assumed that, to be safe, the code would have to be

__global__ void sumsafe(real *v, long int s){
    long int ix     =  threadIdx.x;
    long int shift = blockIdx.x*blockDim.x;
    real x = 0;
    long int h = blockDim.x/2;
    while (h >= 1){
        if (ix < h){
            x = v[2*ix + shift] + v[2*ix + 1 + shift];
        }
        __syncthreads(); 
        if (ix < h){
            v[ix +  shift] = x;
        }
        __syncthreads();
        h = h / 2;
    }
}

so that I guarantee that all the threads read their values before they start changing them. But as I said...both codes work...their time is actually also pretty much the same.

Why is this?

I know that the GPU does not guarantee that what one thread writes to global memory is not visible to other threads. But it does not guarantee that this always never happens either.

Any ideas !? I am working on a GTX 1080.

Why do you call `_syncthreads()` if you don't use shared memory ? Also, you should never assume specific order of thread execution. — pSoLT, Jan 07 '17 at 18:58
I am calling __syncthreads to guarantee what one thread writes to global memory is visible to other threads. This in general is necessary — cudarabit, Jan 07 '17 at 19:49
Note this only applies to the threads within one block. If you run the kernel with more than one block, `__syncthreads()` doesn't provide that guarantee with respect to threads in other blocks. — tera, Jan 07 '17 at 20:01
The particular organization of the indexing in this code means that block level sums are independent, so the `__syncthreads()` need not have other than block-level sync in order to be effective *for this code*. — Robert Crovella, Jan 07 '17 at 20:13
Ah right - I didn't look close enough to notice that `shift` was consistently applied everywhere and that `h` was initialized to *half* the blocksize. Maybe I should check SO less often but spend more time when I do. — tera, Jan 07 '17 at 20:20

Robert Crovella · Accepted Answer · 2017-01-07T19:42:54.447

You are indeed being "lucky" because CUDA provides no guarantee of execution order of warps. The following description (which is conjecture) should not be construed as a statement that what you've shown is a good idea. Nobody should do reductions like this.

But for a small test case (no other code than this, and operating on a single block of data), I would expect this to work.

Reads from global memory are generally high-latency. When the execution encounters this line of code:

        v[ix +  shift] = v[2*ix + shift] + v[2*ix + 1 + shift];

that will translate into SASS instructions something like this:

LD  R0, v[2*ix + shift]        (let's call this LD0)
LD  R1, v[2*ix + 1 + shift];   (let's call this LD1)
ADD R3, R0, R1
ST  v[ix + shift], R3

Now, the first two LD operations do not cause a stall. However the ADD operation will cause a stall (it cannot be issued) if R1 and R0 are not valid yet.

The result of the stall will be that the warp scheduling engine in the SM will look for other available work. That other available work will likely constitute the above code for other warps.

Since the ADD instruction cannot be issued until the reads are complete, and the reads (across warps) are all effectively issued back-to-back due to the warp scheduler's response to stalls, the read operations will tend to all complete by the time ADD instructions finish issuing, meaning that all reads complete by the time all ADD operations are issued (and a ST cannot be issued until its corresponding ADD is complete). ADD also has a pipeline latency, so the ADD operations will likely also be issued in sequence (but a short pipeline latency here will likely increase the probability of hazard), and a given ST operation cannot be issued until its corresponding ADD operation is complete. The net effect could be:

LD0 W0
LD1 W0
LD0 W1
LD1 W1
...   (all LD0 and LD1 get issued across all warps W0..WN)
<read latency stall --  eventually the first 2 LD0 and LD1 complete>
ADD W0
<read pipeline latency - 1 cycle>
ADD W1
<read pipeline latency - 1 cycle>
ADD W2
...
<add pipeline latency>
ST W0
<add pipeline latency>
ST W1
...

The result of latency is that all reads get issued to global memory with high likelihood before any ADD operations begin. Due to the pipeline effect, it's possible (likely ?) that all read operations also complete before any ST operations begin, resulting in a likelihood for this limited test case, that no actual hazard errors occur.

I would expect that even if the data is in L2 cache, the read latency from L2 cache may still be enough to allow the above to work. I suspect that if the data were in L1 cache, the read latency from L1 cache (and assuming a maximum complement of warps) may not be enough to cause the above description to hold, but I haven't gone through the arithmetic carefully. Since the ADD pipeline latency is fixed, but the hazard from LD to ST operations is determined by the number of ADD operations as compared to the ADD pipeline latency, the actual hazard probability increases as you load up more warps in the threadblock.

Note that all of the above description is attempting to unpack the behavior of one single iteration of your while loop. The memory barrier effect of __syncthreads() should guarantee that reads of the iteration i+1 are not corrupted by (failure to witness) writes of iteration i.

PS: the code works for summing very large vectors of size 1024*1024*256 — cudarabit, Jan 07 '17 at 19:47
The only thing that matters here is what is going on at the block level. Your code as shown does not sum a large vector into a single number. It sums a large vector into a set of independent block sums. The number of blocks has no effect on the above analysis, except in that more blocks may contribute to more warps available to each SM. But since the threadblocks are independent, this particular increase (of independent warps) actually reduces the probability of hazard. — Robert Crovella, Jan 07 '17 at 19:50
very nice...makes sense....I wonder if there is some way to force the code the produce an error ? Maybe running multiple times again random inputs and hope that something bad will happen ? Also...I assume that the second code is formally correct, right!? — cudarabit, Jan 07 '17 at 19:54
It's possible, [with some effort](http://stackoverflow.com/questions/24977294/is-cuda-warp-scheduling-deterministic/24977532#24977532), to force warps to execute in a desired order. Feel free to explore that. As I stated in my answer, this is conjecture. I would never use either of these methods for a reduction, so spending my time convincing myself of the formal correctness of either case is not something I would want to do. But since your 2nd case breaks the read and writes into separate regions controlled by barriers, it would seem to me to prevent the supposed hazard in the first code. — Robert Crovella, Jan 07 '17 at 20:01
Also, the [racecheck](http://docs.nvidia.com/cuda/cuda-memcheck/index.html#racecheck-tool) tool of [CUDA-memcheck](https://developer.nvidia.com/cuda-memcheck) would be able to detect the race, if you moved the array to shared memory. — tera, Jan 07 '17 at 20:05

Syncthread and global memory

1 Answers1