I am still a little unsure when it comes to shared/local memory in CUDA. Currently I have a kernel, within the kernel each thread allocates a list object. Something like this
__global__ void TestDynamicListPerThread()
{
//Creates a dynamic list (Each thread gets its own list)
DynamicList<int> dlist(15);
//Display some ouput information
printf("Allocated a new DynamicList, size=%d, got pointer %p\n", dlist.GetSizeInBytes(),dlist.GetDataPtr());
//Loops through and inserts multiples of four into the list
for (int i = 0; i < 12; i++)
dlist.InsertAtEnd(i*4);
}
By my current understanding each thread gets its own dlist
stored in local memory, is this true?
If that is the case, would there be any way at the end of the kernels execution to grab each of the dlist
objects (from another kernel), or should I be using a __shared__
array of dynamic lists allocated by the first thread?
I think I may be over-complicating things a little, but I never need to change the lists per say, the execution I am trying to achieve goes something like this
- Create lists (Done on the GPU only)
- Produce output from each list (Done on the GPU, by each thread, needs only the information from the list allocated for that thread.)
- Modify/Swap lists (Still done on the GPU)
- Repeat 2 and 3 until some break condition is met on the host
Thanks in advance!