Divide the work of a 3D kernel among blocks

Question

I have such a 3D kernel that I currently run on one block:

// The two following variables are set elsewhere in the program.
// I give them possible value here for demonstration purposes.
int* N = {14, 5, 1};
int L = 2; // N's size - 1

int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.x * blockDim.y + threadIdx.y;
int idz = blockIdx.x * blockDim.z + threadIdx.z;

int idxInc = idx + 1; // for not to waste threads whose idx = 0
if (idxInc >= 1 && idxInc <= L)
{
    if (idy < N[idxInc])
    {       
        if (idz < N[idxInc-1])
        {
            dw[ idxInc ][ idy ][ idz ] = 0;
        }
    }
}

If I launch this kernel on one block, whose dimensions are {2, 5, 14}, everything is alright. It's exactly the number of threads needed for each dimension of the block, for the kernel to do the job on the data as defined in the two first lines. Now, I don't see how to divide this work among multiple blocks. My brain bugs simply trying to find the right amount of thread for each dimension over two blocks. Moreover, L may vary (but I might put a restriction on this), and more likely N[1] will vary a lot (it's 5 in this example, but could be 128, 256, or 2048...). So I have to find an algorithm that automatically balance the number of blocks, and the number of threads in each of the three dimensions of a block.

I really don't see how to do, and now I feel stupid! I begin to think that I should just stop playing with 3 dimensions... Or maybe is there a simple trick I just can't see...

Some help? Thank you!

Edit: to serially check the result...

for (layer = 1; layer <= L; layer++)
{
    for (i = 0; i < N[layer]; i++)
    {
        for (j = 0; j < N[layer-1]; j++)
        {
            printf("%1.0f", dw[ layer ][ i ][ j ]);
        }
        printf("\n");
    }
    printf("\n");
}

Every number displayed should be 0.

Are you saying the total scope of your problem is defined by the {2, 5, 14} threadblock structure? If so that's a tiny little problem. The way to extend this is to propose a much larger data set. In that case, most codes I have seen simply pick some convenient threadblock size, like {8, 8, 8} and divide up the entire 3D domain into blocks of that size, each thread operating on one element. Just like in the 1D case, we should have a thread check in our kernel code which checks the thread indices against valid data indices. You'll have some wasted threads on the edges of the domain; its OK. — Robert Crovella, May 23 '13 at 23:00
Here's a [complete code example](http://stackoverflow.com/questions/14920931/3d-cuda-kernel-indexing-for-image-filtering/14926201#14926201) of what I describe above. — Robert Crovella, May 23 '13 at 23:02
int* N = {2, 5, 14} is just an example I use because I find easier to think about small data. It's still my neural network that I'm trying to parallelize, if you remember... The central value, 5, is the number of neurons of the hidden layer. It is likely to be set on 128, or 1024. The user has to choose this value. But the number of layers, 2, is unlikely to change. And the number of neurons of the input layer, 14, will never change. So if I pick a symmetric block for such a set of data: {2, 1024, 14}, it might result in a lot of wasted threads... — Yugo Amaryl, May 23 '13 at 23:34
Anyway, my code does not work, even for a symetric block. If I put 128 neurons in the hidden layer -> N = {14, 128, 1}, and choose your convenient block of {8,8,8}, I need to launch 16 blocks. The result is only a little square of 0 whose size is...8x8x8! And the rest is not initialized. I guess it has something to do with the thread ids computation. I'm searching, but my breain is broken... — Yugo Amaryl, May 23 '13 at 23:44
This -> "if (idx==127 && idy == 127 && idz == 127) printf("HELLO\n");" is Ok, it is printed. But this -> "if (idx==0 && idy == 127 && idz == 13) printf("BONJOUR\n");" for some reason, does not display! Why is there a thread {127,127,127} but no thread {0, 127, 13}??? I go crazy! — Yugo Amaryl, May 24 '13 at 01:07
If your domain is fixed at {2, xxx, 14} then I probably wouldn't go with {8, 8, 8} blocks. Maybe a 1D grid of 3D blocks like {2, 16, 14} or something like that (total of 64 blocks for this example assuming xxx=1024). — Robert Crovella, May 24 '13 at 04:28

score 2 · Answer 1 · answered May 24 '13 at 05:05

Here's a simple sample code along the lines (I think) of what you are describing:

#include <stdio.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

// for simplicity assume grid is an even multiple of blocksizes
#define XSIZE 1024
#define YSIZE 14
#define ZSIZE 2
#define TSIZE (XSIZE*YSIZE*ZSIZE)
#define BLKX 16
#define BLKY 14
#define BLKZ 2


#define IDX(x,y,z) ((z*(XSIZE*YSIZE))+(y*XSIZE)+x)

typedef float mytype;

__global__ void mykernel(mytype *data){

  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  int idy = threadIdx.y + blockDim.y*blockIdx.y;
  int idz = threadIdx.z + blockDim.z*blockIdx.z;

  if ((idx < XSIZE)&&(idy < YSIZE)&&(idz < ZSIZE))
    data[IDX(idx,idy,idz)] = (mytype)idx;
  if ((idx==127)&&(idy==13)&&(idz==1)) printf("BONJOUR\n");
}

int main(){


// for simplicity assume grid is an even multiple of blocksizes
  dim3 block(BLKX, BLKY, BLKZ);
  dim3 grid(XSIZE/BLKX, YSIZE/BLKY, ZSIZE/BLKZ);

  mytype *h_data, *d_data;

  h_data=(mytype *)malloc(TSIZE*sizeof(mytype));
  if (h_data == 0) {printf("malloc fail\n"); return 1;}
  cudaMalloc((void **)&d_data, TSIZE*sizeof(mytype));
  cudaCheckErrors("cudaMalloc fail");

  for (int x=0; x<XSIZE; x++)
    for (int y=0; y<YSIZE; y++)
      for (int z=0; z<ZSIZE; z++)
        h_data[IDX(x,y,z)] = (mytype)0;

  cudaMemcpy(d_data, h_data, TSIZE*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy fail");

  mykernel<<<grid, block>>>(d_data);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");

  cudaMemcpy(h_data, d_data, TSIZE*sizeof(mytype), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy fail");


  for (int x=0; x<XSIZE; x++)
    for (int y=0; y<YSIZE; y++)
      for (int z=0; z<ZSIZE; z++)
        if(h_data[IDX(x,y,z)] != (mytype)x) {printf("data check fail at (x,y,z) = (%d, %d, %d), was: %f, should be: %f\n", x,y,z, h_data[IDX(x,y,z)], x); return 1;}
  printf("Data check passed!\n");


  return 0;
}

compile with:

nvcc -arch=sm_20 -o t159 t159.cu

when I run it I get:

BONJOUR
Data check passed!

Very nice of you! Ok, let me ask you about several points, because I'm learning from every line from you! First, I didn't know one could launch a 3D grid. Now, the ids computation makes sense and everything is initialized! I just added the use of ceil for the grid dimensions calculation: (ceil(XSIZE/BLKX), ceil(YSIZE/BLKY), ceil(ZSIZE/BLKZ)). Next, you use a continuous vector instead of a 3D array. I'm told this is faster. I guess accessing a case from a 3D array implies 3 memory accesses, which is slower than one of your absolute id calculation and on memory access. Is it something like that? — Yugo Amaryl, May 24 '13 at 12:44
I'm also told I should choose my block dimensions to be a multiple of the wrap size. I guess for not to load and unload wraps with only a couple of threads inside. You use 16*14*2, which is a multiple of 32. But, if I want 2*128*128 threads in my grid, I can either chose a block size of 2*22*22, which results in a non 32 multiple threads number per block but a grid of 1*6*6 blocks, or choose a block size of 1*32*32, it launches a 2*4*4 grid! Which solution is theoretically the best? More blocks with complete wraps? Or less blocks with incomplete wraps? — Yugo Amaryl, May 24 '13 at 12:59
Yes, you can use `ceil` in grid size computation. I just left it out for clarity/simplicity. A 1D array is easier to copy back and forth between host and device. I don't think there's any difference in terms of access speed/time. Yes, block dimensions should be a multiple of warp size. And we also want many adjacent threads in the X dimension and we want X to be our "rapidly varying dimension" in order to enable coalescing, i.e. efficient memory access. So consider re-ordering your data storage to make X the biggest dimension like I did. 32*32*1 (x,y,z) is better than 2*22*22. — Robert Crovella, May 24 '13 at 14:02

Divide the work of a 3D kernel among blocks

1 Answers1