CUDA efficient usage of shared/local memory?

Question

I am still a little unsure when it comes to shared/local memory in CUDA. Currently I have a kernel, within the kernel each thread allocates a list object. Something like this

__global__ void TestDynamicListPerThread()
{
    //Creates a dynamic list (Each thread gets its own list)
    DynamicList<int>  dlist(15);

    //Display some ouput information
    printf("Allocated a new DynamicList, size=%d, got pointer %p\n", dlist.GetSizeInBytes(),dlist.GetDataPtr());

    //Loops through and inserts multiples of four into the list
    for (int i = 0; i < 12; i++)
        dlist.InsertAtEnd(i*4);
}

By my current understanding each thread gets its own dlist stored in local memory, is this true? If that is the case, would there be any way at the end of the kernels execution to grab each of the dlist objects (from another kernel), or should I be using a __shared__ array of dynamic lists allocated by the first thread?

I think I may be over-complicating things a little, but I never need to change the lists per say, the execution I am trying to achieve goes something like this

Create lists (Done on the GPU only)
Produce output from each list (Done on the GPU, by each thread, needs only the information from the list allocated for that thread.)
Modify/Swap lists (Still done on the GPU)
Repeat 2 and 3 until some break condition is met on the host

Thanks in advance!

score 1 · Accepted Answer · answered Jan 20 '13 at 00:27

By my current understanding each thread gets its own dlist stored in local memory, is this true?

That is correct. Local variables are created per thread. They will be stored either in a register or in a local memory, where the variable ends depends mostly on the compiler.

If that is the case, would there be any way at the end of the kernels execution to grab each of the dlist objects (from another kernel), or should I be using a __shared__ array of dynamic lists allocated by the first thread?

Local memory is private to the thread (an exception: starting with compute capability 3.0 there are some intrawarp instruction that can facilitate exchange of thread-local variables between the threads within a warp) so you would need to copy the local variable to some global memory variable if you need to get it's value outside the kernel. __shared__ memory is allocated per threadblock and is only accessible within that threadblock so again you would need to copy the value to a global memory location.

What you probably need is something like a global array of lists that you pass around to your kernels as a parameter.

CUDA efficient usage of shared/local memory?

1 Answers1