3

I am working on something that has highlighted the fact I don't have a firm grasp of how blocks and grids work in cuda. I have a 1000x10 matrix that I would like to traverse and fill in each element with a value. The kernel is like this:

__global__ void myfun(float *vals,float *out, int M, int N)
  {  
      int row = blockIdx.y*blockDim.y + threadIdx.y;
      int col = blockIdx.x*blockDim.x + threadIdx.x;
      int index = row*N + col;

      if( (row < M ) && (col < N) ) {
          out[index] = index;
      }
}

where, M=1000 and N = 10. I don't know how to slice this up so that I can cover every element in the matrix. Since I need coverage for 1000*10 = 10,000 elements and given the limitations on the number of threads, I can't use block sizes of (10,1000,1). Using pycuda, I've tried things like block = (10,100,1), grid = (1,10) but I never get full coverage of the matrix elements. What's the right way to do this?

user1489497
  • 127
  • 9

1 Answers1

3

Fix the block size, and keep the grid size dynamic. In this way, the kernel will cover each element of the matrix no matter what the values of M and N are.

block = (8,8)
grid = ((N + 7) / 8, (M + 7) / 8)

Launch the kernel with this grid and block configuration. Keeping in limits of the device, you may change the block size if desired.

sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • I know that the `block(X,Y)` elements should be a round multiple of the magic number 32. But what do these `7`'s and `8`'s mean? Could It be that this function is a `grid = ((N+(X-1)/X, (M+(Y-1)/Y)` and if so why / whats the meaning of it ? – user3085931 Apr 23 '16 at 09:25
  • 1
    @user3085931... Yes exactly , this is the generic formula for any block size. I provided calculations specific to block size `8 x 8`. – sgarizvi Apr 23 '16 at 15:02