bit shift operation in parallel prefix sum

Question

The code is to compute prefix sum parallelly from OpengGL-Superbible 10.

The shader shown has a local workgroup size of 1024, which means it will process arrays of 2048 elements, as each invocation computes two elements of the output array. The shared variable shared_data is used to store the data that is in flight. When execution starts, the shader loads two adjacent elements from the input arrays into the array. Next, it executes the barrier() function. This step ensures that all of the shader invocations have loaded their data into the shared array before the inner loop begins.

#version 450 core
layout (local_size_x = 1024) in;
layout (binding = 0) coherent buffer block1
{
    float input_data[gl_WorkGroupSize.x];
};
layout (binding = 1) coherent buffer block2
{
    float output_data[gl_WorkGroupSize.x];
};
shared float shared_data[gl_WorkGroupSize.x * 2];
void main(void)
{
    uint id = gl_LocalInvocationID.x;
    uint rd_id;
    uint wr_id;
    uint mask;// The number of steps is the log base 2 of the
    // work group size, which should be a power of 2
    const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
    uint step = 0;
    // Each invocation is responsible for the content of
    // two elements of the output array
    shared_data[id * 2] = input_data[id * 2];
    shared_data[id * 2 + 1] = input_data[id * 2 + 1];
    // Synchronize to make sure that everyone has initialized
    // their elements of shared_data[] with data loaded from
    // the input arrays
    barrier();
    memoryBarrierShared();
    // For each step...
    for (step = 0; step < steps; step++)
    {
        // Calculate the read and write index in the
        // shared array
        mask = (1 << step) - 1;
        rd_id = ((id >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (id & mask);
        // Accumulate the read data into our element
        shared_data[wr_id] += shared_data[rd_id];
        // Synchronize again to make sure that everyone
        // has caught up with us
        barrier();
        memoryBarrierShared();
    } // Finally write our data back to the output image
    output_data[id * 2] = shared_data[id * 2];
    output_data[id * 2 + 1] = shared_data[id * 2 + 1];
}

How to comprehend the bit shift operation of rd_id and wr_id intuitively? Why it works?

score 0 · Answer 1 · answered Aug 09 '22 at 23:56

When we say something is "intuitive" we usually mean that our understanding is deep enough that we are not aware of our own thought processes, and "know the answer" without consciously thinking about it. Here the author is using the binary representation of integers within a CPU/GPU to make the code shorter and (probably) slightly faster. The code will only be "intuitive" for someone who is very familiar with such encodings and binary operations on integers. I'm not, so had to think about what is going on.

I would recommend working through this code since these kind of operations do occur in high performance graphics and other programming. If you find it interesting, it will eventually become intuitive. If not, that's OK as long as you can figure things out when necessary.

One approach is to just copy this code into a C/C++ program and print out the mask, rd_id, wr_id, etc. You wouldn't actually need the data arrays, or the calls to barrier() and memoryBarrierShared(). Make up values for invocation ID and workgroup size based on what the SuperBible example does. That might be enough for "Aha! I see."

If you aren't familiar with the << and >> shifts, I suggest writing some tiny programs and printing out the numbers that result. Python might actually be slightly easier, since

print("{:016b}".format(mask))

will show you the actual bits, whereas in C you can only print in hex.

To get you started, log2 returns the number of bits needed to represent an integer. log2(256) will be 8, log2(4096) 12, etc. (Don't take my word for it, write some code.)

x << n is multiplying x by 2 to the power n, so x << 1 is x * 2, x << 2 is x * 4, and so on. x >> n is dividing by 1, 2, 4, .. instead. (Very important: only for non-negative integers! Again, write some code to find out what happens.)

The mask calculation is interesting. Try

mask = (1 << step);

first and see what values come out. This is a common pattern for selecting an individual bit. The extra -1 instead generates all the bits to the right.

Anding, the & operator, with a mask that has zeroes on the left and ones on the right is a faster way for an integer % a power of 2.

Finally rd_id and wr_id array indexes need to start from base positions in the array, from the invocation ID and workgroup size, and increment according to the pattern explained in the Super Bible text.

Thanks a lot! It is a new perspective for me that "a faster way for an integer % a power of 2". — mq s, Aug 30 '22 at 13:10

bit shift operation in parallel prefix sum

1 Answers1