I am working on something that has highlighted the fact I don't have a firm grasp of how blocks and grids work in cuda. I have a 1000x10 matrix that I would like to traverse and fill in each element with a value. The kernel is like this:
__global__ void myfun(float *vals,float *out, int M, int N)
{
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
int index = row*N + col;
if( (row < M ) && (col < N) ) {
out[index] = index;
}
}
where, M=1000 and N = 10. I don't know how to slice this up so that I can cover every element in the matrix. Since I need coverage for 1000*10 = 10,000 elements and given the limitations on the number of threads, I can't use block sizes of (10,1000,1). Using pycuda, I've tried things like block = (10,100,1), grid = (1,10) but I never get full coverage of the matrix elements. What's the right way to do this?