prefix sum using CUDA

Question

I am having trouble understanding a cuda code for naive prefix sum.

This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html In example 39-1 (naive scan), we have a code like this:

 __global__ void scan(float *g_odata, float *g_idata, int n)
    {
    extern __shared__ float temp[]; // allocated on invocation
    int thid = threadIdx.x;
    int pout = 0, pin = 1;
    // Load input into shared memory.
    // This is exclusive scan, so shift right by one
    // and set first element to 0
    temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
 __syncthreads();

 for (int offset = 1; offset < n; offset *= 2)
  {
    pout = 1 - pout; // swap double buffer indices
    pin = 1 - pout;
    if (thid >= offset)
      temp[pout*n+thid] += temp[pin*n+thid - offset];
    else
      temp[pout*n+thid] = temp[pin*n+thid];
    __syncthreads();
  }
  g_odata[thid] = temp[pout*n+thid1]; // write output
}

My questions are

Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:

for d = 1 to log2 n do for all k in parallel do if k >= 2^d then x[k] = x[k – 2^(d-1)] + x[k]

This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...

I don't understand why you are asking for a "good and detailed reference" when you have already linked to the GPU Gems article, which is the canonical reference. — talonmies, Apr 08 '18 at 20:08
That reference didn't have many comments in the code. And it cannot answer the question I asked. — noobie2023, Apr 08 '18 at 20:20
asking for references to off-site material is explicitly off-topic on stack overflow. See item 4 [here](https://stackoverflow.com/help/on-topic). — Robert Crovella, Apr 08 '18 at 20:23
If this is your first time using CUDA, you should not be starting with prefix-sum. Get basic principles like vector add and stencil algorithms under your belt first. You should learn what shared memory is used for and what purposes it serves, before you try to tackle prefix-sum. The basic algorithm here is a ping-pong algorithm, successively moving data from a source buffer to a destination buffer, then doing it again in the reverse direction, until the work is complete. So you need temporary buffers of some sort. Shared memory is a logical choice for this. — Robert Crovella, Apr 08 '18 at 20:30
If you can get that far, then the reason for `pout` and `pin` should be clear - I need a way to refer to and reverse the sense of the source and destination ping-pong buffers. The ping-pong concept should be fairly evident if you study the stepwise sequence taking place in Fig. 39-2 in the article you linked. — Robert Crovella, Apr 08 '18 at 20:32
@RobertCrovella Thank you for providing a map to learn! I'm taking a CS course and I am "forced" to jump into CUDA right after learning basic "MPI". My CS course is usually given just like a "review section". And the due day of my homework is there. I guess the only choice now I have is to sleep less... — noobie2023, Apr 08 '18 at 20:36
This code has a couple of bugs, it seems the authors never bothered to actually run it. First, it does not even compile because of the `thid1` typo in the last line. Second, the in-place increment `temp[pout*n+thid] += temp[pin*n+thid - offset];` is incorrect, and should be written as `temp[pout*n+thid] = temp[pin*n+thid - offset] + temp[pin*n+thid];`. After fixing these bugs, the algorithm should produce the correct output (for arrays whose size are a power of 2). — nth, Jul 28 '23 at 13:47

score 3 · Accepted Answer · answered Apr 20 '18 at 08:50

1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.

2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:

temp[pout*n+thid] += temp[pin*n+thid - offset];

First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1. It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.

3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.

Hope it helps.

The last line is g_odata[thid] = temp[pout * n+thid1]; , it should be g_odata[thid] = temp[pout*n+thid] ? So thid1 is thid, it's an error ? — Will, May 26 '20 at 17:47

prefix sum using CUDA

1 Answers1