Parallel Prefix Sum CUDAfy

Question

I need an algorithm for computing the parallel prefix sum of an array without using shared memory. And if there is no other alternative to using shared memory, what is the best way to tackle the problem of conflicts?

This is covered by [GPU Gems 3](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html) — harold, Jan 10 '16 at 13:22

score 3 · Accepted Answer · edited Nov 25 '19 at 15:02

This link contains a detailed analysis of the sequential and the parallel algorithms for parallel prefix sum:

Parallel Prefix Sum (Scan) with CUDA

It also contains a fragment of the C code for the implementation of the parallel prefix algorithm and a detailed explanation for avoiding the shared memory conflicts.

You can either port the codes to CUDAfy or simply define regions of C and use them as unmanaged code from your application. But there are several mistakes in the CUDA C code. I am writing the corrected version of the code in Cudafy.NET

[Cudafy]
public static void prescan(GThread thread, int[] g_odata, int[] g_idata, int[] n)
{
    int[] temp = thread.AllocateShared<int>("temp", threadsPerBlock);//threadsPerBlock is user defined
    int thid = thread.threadIdx.x;
    int offset = 1;
    if (thid < n[0]/2)
    {
        temp[2 * thid] = g_idata[2 * thid]; // load input into shared memory  
        temp[2 * thid + 1] = g_idata[2 * thid + 1];

        for (int d = n[0] >> 1; d > 0; d >>= 1)                    // build sum in place up the tree  
        {
            thread.SyncThreads();
            if (thid < d)
            {
                int ai = offset * (2 * thid + 1) - 1;
                int bi = offset * (2 * thid + 2) - 1;
                temp[bi] += temp[ai];
            }
            offset *= 2;
        }
        if (thid == 0)
        {
            temp[n[0] - 1] = 0;
        } // clear the last element  


        for (int d = 1; d < n[0]; d *= 2) // traverse down tree & build scan  
        {
            offset >>= 1;
            thread.SyncThreads();
            if (thid < d)
            {
                int ai = offset * (2 * thid + 1) - 1;
                int bi = offset * (2 * thid + 2) - 1;
                int t = temp[ai];
                temp[ai] = temp[bi];
                temp[bi] += t;
            }
        }
        thread.SyncThreads();
        g_odata[2 * thid] = temp[2 * thid]; // write results to device memory  
        g_odata[2 * thid + 1] = temp[2 * thid + 1];
    }
}

You can use the above modified code instead of the one in the link.

Parallel Prefix Sum CUDAfy

1 Answers1