Preliminaries
Usually, people are interested in GPU computing for performance reasons - to make their codes run faster. So we'll keep performance as a guide when trying to make decisions about what to do.
I think one of the problems the sketch you provided in your question would have is one of natural alignment requirement in CUDA. Picking an arbitrary pointer and type-casting it to a different type may run afoul of this. If you have such a problem in your code, the cuda-memcheck
tool should be able to uncover it.
The typical place to put thread-private arrays in C++ is local memory, and CUDA is no different in my opinion. However CUDA C++, at least, does not support variable-length arrays. In your question, you sketched out using shared memory as a proxy for this. One of the implications of your thinking (I assume) is that although the size of these arrays are not known at compile time, there must be an upper bound to size, because shared memory may impose a limit of as low as 48KB per threadblock. Therefore if you have 1024 threads in a threadblock, then the maximum combined array size per thread would be limited to 48 bytes. With 512 threads per block you could conceivably have 96 bytes per thread. These would be due to shared memory limits if you used shared memory.
So an alternate approach (if you can adhere to these low limits) would be to simply upper-bound the local memory needed, and statically define a local memory array of that size (or 3), per thread. A single array would have to be partitioned among the various arrays, with attention paid to alignment as already mentioned. But given the small sizes suggested by your approach (e.g. ~96 bytes total) it probably would be expedient just to use upper-bounded fixed-size local arrays (not shared memory).
Local memory in CUDA is ultimately backed by the same physical resource -- GPU DRAM memory -- as global memory. However the arrangement is such that if each thread is accessing a particular element in their own local memory, the effect across threads would be equivalent to coalesced access, should that access need to be serviced by DRAM. This means the per-thread local storage is interleaved, in some fashion. And this interleaving characteristic is also something we will want to pay attention to, for performance reasons, if we come up with our own variable-length array implementation. It applies equally to a global memory proxy (to enable coalescing) or a shared memory proxy (to avoid bank conflicts).
In addition to the desire to interleave access for performance reasons, a possible performance reason to not prefer a shared memory implementation is that extensive use of shared memory can have negative implications for occupancy, and therefore for performance. This topic is covered in many other places, so I'll not drill into it further here.
Implementations
Local Memory
As mentioned above, I believe one of the implicit assumptions about your suggestion to use shared memory is that there must be some (reasonably small) upper bound to the actual sizes of the arrays needed. If that is the case, it may be expedient to use 3 arrays allocated with the upper bound size:
const int Max_aLen = 9;
const int Max_bLen = 5;
const int Max_cLen = 9;
__global__ void mykernel() {
type1 a[Max_aLen];
type2 b[Max_bLen];
type3 c[Max_cLen];
...
}
Using up to e.g. 8kbytes per thread for local memory should not be a major concern, in my opinion, but it may depend on your GPU and memory size, and the analysis mentioned/linked below should indicate any problems. Certainly the low levels/limits e.g. ~96 bytes per thread should not be an issue.
Global Memory
I believe the simplest and most flexible approach would be to provide storage for such variable length arrays via global memory, and pointers passed to the kernel. This allows us to allocate storage for each array via e.g. cudaMalloc
, and we can handle separate arrays separately, and we need pay relatively little attention to alignment requirements. Since we are pretending these global arrays will be used as if they were thread-private, we will want to arrange our indexing to create interleaved storage/access per thread, which will facilitate coalescing. It could look like this for your 3-array example:
#include <stdio.h>
typedef unsigned type1;
typedef char type2;
typedef double type3;
__global__ void mykernel(type1 *a, type2 *b, type3 *c) {
size_t stride = (size_t)gridDim.x * blockDim.x;
size_t idx = (size_t)blockIdx.x*blockDim.x+threadIdx.x;
a[7*stride+idx] = 4; // "local" access to a
b[0*stride+idx] = '0'; // "local" access to b
c[3*stride+idx] = 1.0; // "local" access to c
}
int main(){
// 1D example
type1 *d_a;
type2 *d_b;
type3 *d_c;
// some arbitrary choices to be made at run-time
size_t alen = 27;
size_t blen = 55;
size_t clen = 99;
int nTPB = 256;
int nBLK = 768;
size_t grid = (size_t)nBLK*nTPB;
// allocate
cudaMalloc(&d_a, alen*grid*sizeof(type1));
cudaMalloc(&d_b, blen*grid*sizeof(type2));
cudaMalloc(&d_c, clen*grid*sizeof(type3));
// launch
mykernel<<<nBLK, nTPB>>>(d_a, d_b, d_c);
cudaDeviceSynchronize();
}
A possible criticism of this approach is that it arguably may consume more device memory than the local memory approach would (it might also consume less, depending on grid size relative to GPU type). However this could possibly be managed by limiting the grid size via an approach such as grid-stride looping.
Shared Memory
Since we only have one pointer to shared memory for dynamically allocated shared memory, if we do something with shared memory, we're going to have to pay careful attention to alignment. Here is an example of the type of calculations that are needed to allocate and position properly aligned pointers:
#include <stdio.h>
typedef unsigned type1;
typedef char type2;
typedef double type3;
__global__ void mykernel(int b_round_up, int c_round_up) {
extern __shared__ char sdata[];
type1 *a = (type1 *)sdata;
type2 *b = (type2 *)(sdata + b_round_up);
type3 *c = (type3 *)(sdata + c_round_up);
size_t stride = blockDim.x;
size_t idx = threadIdx.x;
a[7*stride+idx] = 4; // "local" access to a
b[0*stride+idx] = '0'; // "local" access to b
c[3*stride+idx] = 1.0; // "local" access to c
}
int main(){
// 1D example
// some arbitrary choices to be made at run-time
int alen = 9;
int blen = 5;
int clen = 9;
int nTPB = 256;
int nBLK = 1;
// calculate aligned shared mem offsets
int b_round_up = (((nTPB*alen*sizeof(type1) + sizeof(type2)-1)/sizeof(type2))*sizeof(type2)); // round up
int c_round_up = (((b_round_up + nTPB*blen*sizeof(type2) + sizeof(type3)-1)/sizeof(type3))*sizeof(type3)); // round up
// allocate + launch
mykernel<<<nBLK, nTPB, c_round_up + nTPB*clen*sizeof(type3)>>>(b_round_up,c_round_up);
cudaDeviceSynchronize();
}
I'm not suggesting that any of my codes are defect-free, but you can see from the relative code complexity standpoint, that the Local or Global options would be preferred. I'm not readily able to imagine reasons or cases where the shared memory implementation would be preferred.