CUDA per-thread arrays with different types

Question

Each instance of my CUDA kernel (i.e. each thread) needs three private arrays, with different types.

e.g.

__global__ void mykernel() {
    type1 a[aLen];
    type2 b[bLen];
    type3 c[cLen];

    ...
}

The sizes of these types are not known before compile-time, and the lengths aLen, bLen and cLen are dynamic.

Of course I must allocate a single instance of shared memory for the entire block.

void caller() {
    int threadsPerCUDABlock = ...
    int CUDABlocks = ...

    int threadMemSize = 
        aLen*sizeof(type1) + bLen*sizeof(type2) + cLen*sizeof(type3);

    int blockMemSize = threadsPerCUDABlock * threadMemSize;

    mykernel <<< CUDABlocks, threadsPerCUDABlock, blockMemSize >>>();
}

It's then the task for each thread to work out which partition of the shared memory is its private memory space, and how to delimit that into the sub-arrays of the 3 types. In this instance, I organise the shared memory array to have structure:

[ thread0_a, thread0_b, thread0_c,  thread1_a, ...]

I'm unsure how best to approach unpacking this structure in the kernel. I've experimented with passing on the number of bytes of each thread's private space and initially assuming the memory space is a 1-byte type like char:

mykernel <<< CUDABlocks, threadsPerCUDABlock, blockMemSize >>>(threadMemSize);

__global__ void mykernel(int threadMemSize) {

    extern __shared__ char sharedMem[];

    char* threadMem = &sharedMem[threadMemSize*threadIdx.x]
    type1 *a = (type1*) threadMem;
    type2 *b = (type2*) &a[aLen];
    type3 *c = (type3*) &b[bLen];

    ...
}

This hasn't worked (though without any errors, it's hard to debug), but I'm not convinced it should even work in principle. I can't guarantee for example that the size of types type1, type2 and type3 strictly decrease.

So what is the correct paradigm for doing this in general? That is, the established way for unpacking multiple per-thread arrays of varying type and size?

certainly one problem with your approach is the requirement in CUDA for natural alignment. I suspect if you violated this, that such a violation would be discoverable by running your code with `cuda-memcheck`. Not sure what you mean by "without any errors". Perhaps you are not checking for errors properly. I'm not sure that use of shared memory this way is what I would suggest. Even if you get it working, there is a possibility of bank conflicts. If you need a variable size scratchpad, my suggestion would be to use global memory, but arrange storage pattern for coalesced access. — Robert Crovella, Jun 12 '19 at 14:27
You could also do something in shared memory in a roughly equivalent fashion, while attempting to avoid bank conflicts, with interleaved storage. And I'm not sure there is a widely acknowledged "correct paradigm" for this. — Robert Crovella, Jun 12 '19 at 14:31
Are you able to provide a quick code snippet for this (or a link to an example)? — Anti Earth, Jun 12 '19 at 14:50
Shared memory is much slower than register. Store your arrays in global memory, read in coalesced manner and use register shuffling for sharing purpose. That's much faster this way — Oblivion, Jun 12 '19 at 15:27
This sounds like a spectacularly poor idea, even if it could be made to work — talonmies, Jun 12 '19 at 17:42

Robert Crovella · Accepted Answer · 2019-06-12T22:11:22.330

Preliminaries

Usually, people are interested in GPU computing for performance reasons - to make their codes run faster. So we'll keep performance as a guide when trying to make decisions about what to do.

I think one of the problems the sketch you provided in your question would have is one of natural alignment requirement in CUDA. Picking an arbitrary pointer and type-casting it to a different type may run afoul of this. If you have such a problem in your code, the cuda-memcheck tool should be able to uncover it.

The typical place to put thread-private arrays in C++ is local memory, and CUDA is no different in my opinion. However CUDA C++, at least, does not support variable-length arrays. In your question, you sketched out using shared memory as a proxy for this. One of the implications of your thinking (I assume) is that although the size of these arrays are not known at compile time, there must be an upper bound to size, because shared memory may impose a limit of as low as 48KB per threadblock. Therefore if you have 1024 threads in a threadblock, then the maximum combined array size per thread would be limited to 48 bytes. With 512 threads per block you could conceivably have 96 bytes per thread. These would be due to shared memory limits if you used shared memory.

So an alternate approach (if you can adhere to these low limits) would be to simply upper-bound the local memory needed, and statically define a local memory array of that size (or 3), per thread. A single array would have to be partitioned among the various arrays, with attention paid to alignment as already mentioned. But given the small sizes suggested by your approach (e.g. ~96 bytes total) it probably would be expedient just to use upper-bounded fixed-size local arrays (not shared memory).

Local memory in CUDA is ultimately backed by the same physical resource -- GPU DRAM memory -- as global memory. However the arrangement is such that if each thread is accessing a particular element in their own local memory, the effect across threads would be equivalent to coalesced access, should that access need to be serviced by DRAM. This means the per-thread local storage is interleaved, in some fashion. And this interleaving characteristic is also something we will want to pay attention to, for performance reasons, if we come up with our own variable-length array implementation. It applies equally to a global memory proxy (to enable coalescing) or a shared memory proxy (to avoid bank conflicts).

In addition to the desire to interleave access for performance reasons, a possible performance reason to not prefer a shared memory implementation is that extensive use of shared memory can have negative implications for occupancy, and therefore for performance. This topic is covered in many other places, so I'll not drill into it further here.

Implementations

Local Memory

As mentioned above, I believe one of the implicit assumptions about your suggestion to use shared memory is that there must be some (reasonably small) upper bound to the actual sizes of the arrays needed. If that is the case, it may be expedient to use 3 arrays allocated with the upper bound size:

const int Max_aLen = 9;
const int Max_bLen = 5;
const int Max_cLen = 9;
__global__ void mykernel() {
    type1 a[Max_aLen];
    type2 b[Max_bLen];
    type3 c[Max_cLen];

    ...
}

Using up to e.g. 8kbytes per thread for local memory should not be a major concern, in my opinion, but it may depend on your GPU and memory size, and the analysis mentioned/linked below should indicate any problems. Certainly the low levels/limits e.g. ~96 bytes per thread should not be an issue.

Global Memory

I believe the simplest and most flexible approach would be to provide storage for such variable length arrays via global memory, and pointers passed to the kernel. This allows us to allocate storage for each array via e.g. cudaMalloc, and we can handle separate arrays separately, and we need pay relatively little attention to alignment requirements. Since we are pretending these global arrays will be used as if they were thread-private, we will want to arrange our indexing to create interleaved storage/access per thread, which will facilitate coalescing. It could look like this for your 3-array example:

#include <stdio.h>

typedef unsigned type1;
typedef char     type2;
typedef double   type3;

__global__ void mykernel(type1 *a, type2 *b, type3 *c) {

  size_t stride = (size_t)gridDim.x * blockDim.x;
  size_t idx = (size_t)blockIdx.x*blockDim.x+threadIdx.x;
  a[7*stride+idx] = 4;    // "local"  access to a
  b[0*stride+idx] = '0';  // "local"  access to b
  c[3*stride+idx] = 1.0;  // "local"  access to c
}

int main(){
  // 1D example
  type1 *d_a;
  type2 *d_b;
  type3 *d_c;
  // some arbitrary choices to be made at run-time
  size_t alen = 27;
  size_t blen = 55;
  size_t clen = 99;
  int nTPB = 256;
  int nBLK = 768;
  size_t grid = (size_t)nBLK*nTPB;
  // allocate
  cudaMalloc(&d_a, alen*grid*sizeof(type1));
  cudaMalloc(&d_b, blen*grid*sizeof(type2));
  cudaMalloc(&d_c, clen*grid*sizeof(type3));
  // launch
  mykernel<<<nBLK, nTPB>>>(d_a, d_b, d_c);
  cudaDeviceSynchronize();
}

A possible criticism of this approach is that it arguably may consume more device memory than the local memory approach would (it might also consume less, depending on grid size relative to GPU type). However this could possibly be managed by limiting the grid size via an approach such as grid-stride looping.

Shared Memory

Since we only have one pointer to shared memory for dynamically allocated shared memory, if we do something with shared memory, we're going to have to pay careful attention to alignment. Here is an example of the type of calculations that are needed to allocate and position properly aligned pointers:

#include <stdio.h>

typedef unsigned type1;
typedef char     type2;
typedef double   type3;

__global__ void mykernel(int b_round_up, int c_round_up) {

  extern __shared__ char sdata[];
  type1 *a = (type1 *)sdata;
  type2 *b = (type2 *)(sdata + b_round_up);
  type3 *c = (type3 *)(sdata + c_round_up);
  size_t stride = blockDim.x;
  size_t idx = threadIdx.x;
  a[7*stride+idx] = 4;    // "local"  access to a
  b[0*stride+idx] = '0';  // "local"  access to b
  c[3*stride+idx] = 1.0;  // "local"  access to c
}

int main(){
  // 1D example
  // some arbitrary choices to be made at run-time
  int alen = 9;
  int blen = 5;
  int clen = 9;
  int nTPB = 256;
  int nBLK = 1;
  // calculate aligned shared mem offsets
  int b_round_up = (((nTPB*alen*sizeof(type1) + sizeof(type2)-1)/sizeof(type2))*sizeof(type2)); // round up
  int c_round_up = (((b_round_up + nTPB*blen*sizeof(type2) + sizeof(type3)-1)/sizeof(type3))*sizeof(type3)); // round up
  // allocate + launch
  mykernel<<<nBLK, nTPB, c_round_up + nTPB*clen*sizeof(type3)>>>(b_round_up,c_round_up);
  cudaDeviceSynchronize();
}

I'm not suggesting that any of my codes are defect-free, but you can see from the relative code complexity standpoint, that the Local or Global options would be preferred. I'm not readily able to imagine reasons or cases where the shared memory implementation would be preferred.

This was a really fantastic explanation, thanks very much! Indeed I cannot upper-bound the memory per thread, so will have to use global memory. — Anti Earth, Jun 13 '19 at 09:24

einpoklum · Answer 2 · 2019-06-12T21:30:27.383

Each instance of my CUDA kernel (i.e. each thread)

Threads are not instances of a kernel. Threads are part of a block, blocks form a grid, a grid runs a kernel function.

Each [thread] of my CUDA kernel needs three private arrays

Does it though? I doubt it. I'm guessing your computational problem can be reformulated so that many threads cooperate and work on a single such triplet of arrays (or perhaps several such triplets).

It's then the task for each thread to work out which partition of the shared memory is its private memory space

Not necessarily. Even if you insist on your 3 private arrays, you could place them in "local memory" (which is really just thread-private global memory). If each thread uses just a small amount of local memory, it might all fit in the L2 cache, which - while slower than shared memory optimally - sometimes makes sense for various reasons (e.g. shared memory bank conflicts).

Alternatively, and in the case the overall size of your small arrays is really small, you could consider sticking them into registers. That means you can't use indexed access to them (which is a very constrictive condition), but registers are super fast and there are lots of them - more than the shared memory size, for example.

Whatever you memory space you choose - always measure, and use the profiler to determine whether that's your bottleneck; whether it impacts occupancy, or effective use of the GPU cores' functional units and so on. And if you're not happy with what you get, try one of the other options.

I organise the shared memory array to have structure:
[ thread0_a, thread0_b, thread0_c,  thread1_a, ...]

Yeah... that might not be a good choice. You see, shared memory is arranged in banks; and if your warp's lanes (threads in a warp) try to access data from the same bank, these access get serialized. For example, suppose the size of each array is a multiple of 128 bytes. If all threads in a warp start their work by accessing a[0] (which very often happens) - they will all be trying to access the same bank, and you'll get a 32x slowdown.

If the lanes in a warp tend to access the same indices in the arrays, it is a better idea to interlace the arrays, i.e. use the following arrangement (using your way of illustrating it):

[ thread_0_a[0], thread_1_a[0], thread_2_a[0], ... thread_n_a[0], thread_0_a[1], ... ]

this also has the added benefit that all you need to know is the maximum length of a thread's private array, and the number of threads, to determine exactly where each thread's arrays start. On the other hand, it means you can "pack" less arrays than you otherwise might. But that's not too bad! Use fewer warps per block and you should still be ok.

I should mention that @RobertCrovella 's answer makes a similar point.

Caveat: Note I did say if at the beginning of this part of the answer. It may be the case that the threads' access pattern is different. If it is, interlacing may not help you. Again it is probably a good idea to profile and measure to check.

Since my answer has suggested a more far-reaching change (and since I don't have the time) I won't go into more specific details. Feel free to comment if I've been unclear somewhere.

CUDA per-thread arrays with different types

2 Answers2