0

I've found a piece of OpenCL kernel sample code in Nvidia's developer site The purpose function maxOneBlock is to find out the biggest value of array maxValue and store it to maxValue[0].

I was fully understand about the looping part, but confused about the unroll part: Why the unroll part do not need to sync thread after each step is done?

e.g: When one thread is done the comparison of localId and localId+32, how does it ensure other thread have stored its result to localId+16?

The kernel code:

void maxOneBlock(__local float maxValue[],
                 __local int   maxInd[])
{
    uint localId   = get_local_id(0);
    uint localSize = get_local_size(0);
    int idx;
    float m1, m2, m3;

    for (uint s = localSize/2; s > 32; s >>= 1)
    {
        if (localId < s) 
        {
            m1 = maxValue[localId];
            m2 = maxValue[localId+s];
            m3 = (m1 >= m2) ? m1 : m2;
            idx = (m1 >= m2) ? localId : localId + s;
            maxValue[localId] = m3;
            maxInd[localId] = maxInd[idx];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // unroll the final warp to reduce loop and sync overheads
    if (localId < 32)
    {
        m1 = maxValue[localId];
        m2 = maxValue[localId+32];
        m3 = (m1 > m2) ? m1 : m2;
        idx = (m1 > m2) ? localId : localId + 32;
        maxValue[localId] = m3;
        maxInd[localId] = maxInd[idx];


        m1 = maxValue[localId];
        m2 = maxValue[localId+16];
        m3 = (m1 > m2) ? m1 : m2;
        idx = (m1 > m2) ? localId : localId + 16;
        maxValue[localId] = m3;
        maxInd[localId] = maxInd[idx];

        m1 = maxValue[localId];
        m2 = maxValue[localId+8];
        m3 = (m1 > m2) ? m1 : m2;
        idx = (m1 > m2) ? localId : localId + 8;
        maxValue[localId] = m3;
        maxInd[localId] = maxInd[idx];

        m1 = maxValue[localId];
        m2 = maxValue[localId+4];
        m3 = (m1 > m2) ? m1 : m2;
        idx = (m1 > m2) ? localId : localId + 4;
        maxValue[localId] = m3;
        maxInd[localId] = maxInd[idx];

        m1 = maxValue[localId];
        m2 = maxValue[localId+2];
        m3 = (m1 > m2) ? m1 : m2;
        idx = (m1 > m2) ? localId : localId + 2;
        maxValue[localId] = m3;
        maxInd[localId] = maxInd[idx];

        m1 = maxValue[localId];
        m2 = maxValue[localId+1];
        m3 = (m1 > m2) ? m1 : m2;
        idx = (m1 > m2) ? localId : localId + 1;
        maxValue[localId] = m3;
        maxInd[localId] = maxInd[idx];
    }
}
melode11
  • 3
  • 1

1 Answers1

0

Why the unroll part do not need to sync thread after each step is done?

The sample is incorrect, a barrier is indeed required after each step.

It looks like the sample is written in warp-synchronous style, which is a way of exploiting the underlying execution mechanism of the warps on NVIDIA hardware, but incorrect synchronization will cause it to break if the underlying execution mechanism changes or in presence of compiler optimizations.

user703016
  • 37,307
  • 8
  • 87
  • 112
  • Even if it is written in warp synchronous style, the `unroll` part need to limit threads after each step. i.e. 1st step limit 32 thread, 2nd 16 thread ... etc. But it doesn't, all 32 threads ran the whole `unroll` code . – melode11 Jun 01 '15 at 06:08
  • Yes, but their result is not used: they're just doing extra work for free. Instead of disabling half threads at every iteration, the author chose to let them run. It makes the code simpler, does not impact the performance nor the final result. This is a relatively common technique. But it doesn't make barriers optional, and in that respect the sample is wrong. – user703016 Jun 01 '15 at 06:23