1

Are there limitations as to what I can set the grid size of a CUDA kernel to be? I ran into a problem where kernels were not launching with a grid size of 33 x 33 but were able to launch when the grid size was 32 x 32. Is there any reason for this to occur? Or is it likely that changing the number of blocks from 32 x 32 to 33 x 33 broke some other constraint?

dim3 blockSize(8, 8);
dim3 gridSize(32, 32);

cudaDeviceSynchronize();
set_start<<<gridSize, blockSize>>>(some_params);

The above works.

dim3 blockSize(8, 8);
dim3 gridSize(33, 33);

cudaDeviceSynchronize();
set_start<<<gridSize, blockSize>>>(some_params);

The above does not work.

kernel & main:

__global__
void set_start(double * const H , double * const HU , double * const HV , 
           double * const E , const int Na)
{
int j = threadIdx.x + blockIdx.x*blockDim.x + 1;
int i = threadIdx.y + blockIdx.y*blockDim.y + 1;

if(i >= Na-1 || j >= Na-1)
    return;

H[i*Na+j]  = 1.0 + exp(-100.0*((E[j-1]-0.75)*(E[j-1]-0.75)+(E[i-1]-0.75)*(E[i-1]-0.75))) + 0.5*exp(-100.0*((E[j-1]-0.75)*(E[j-1]-0.75)+(E[i-1]-0.25)*(E[i-1]-0.25)));
HU[i*Na+j] = 0; 
HV[i*Na+j] = 0;
}

int main(int argc, char** argv){

double* E_d;
cudaMalloc(&E_d, sizeof(double) * (Nh+1));
set_E<<<64, (Nh/64) + 1>>>(E_d, dx, Nh);

int Na = 259;
double *H_d, *HU_d, *HV_d, *Ht_d, *HUt_d, *HVt_d;

cudaMalloc(&H_d , sizeof(double) * Na * Na);
cudaMalloc(&HU_d, sizeof(double) * Na * Na);
cudaMalloc(&HV_d, sizeof(double) * Na * Na);

dim3 blockSize(8, 8);
//dim3 gridSize(((Na-1)/blockSize.x) + 1, ((Na-1)/blockSize.y) + 1);
//dim3 gridSize(33, 33);
dim3 gridSize(32, 32);

cudaDeviceSynchronize();
set_start<<<blockSize, gridSize>>>(H_d, HU_d, HV_d, E_d, Na);
}

This was on CUDA 7.0.

  • 3
    Do you by any chance mean block size instead of grid size? 1024 is the maximum number of threads per block since cc 2.0. – void_ptr May 13 '15 at 15:34
  • @void_ptr Likely. It's not uncommon for introductory examples to use a single block (maybe to simplify the `blockIdx/threadIdx...`-hassle?), and when people are taking such examples as "templates", they quickly hit the max-threads-per-block-limit when the *actually* want to increase the *grid* size. – Marco13 May 13 '15 at 15:42
  • No I specifically mean grid size. It's a real weird issue I ran into today. – Curtis Bezault May 13 '15 at 15:55
  • 2
    Yes there are limits. Refer to tables 11 and 12 in the programming guide or run the deviceQuery Cuda sample code. If you want to know why a specific code is not working you should provide an MCVE – Robert Crovella May 13 '15 at 15:56
  • Okay, besides the extremely large maximum number of blocks? – Curtis Bezault May 13 '15 at 16:00
  • No, grid limit are much larger than 32. The code you've added is not an MCVE and does not shed any light on the issue. – Robert Crovella May 13 '15 at 16:24
  • Take a look I added something more comprehensive. – Curtis Bezault May 13 '15 at 16:40
  • The "`+1`" of the index computations look suspicious (but I did not analyze the kernel in detail). What exactly is the problem? Does it crash arbitrarily, or return some error code? Did you try running it in `cuda-memcheck`? – Marco13 May 13 '15 at 16:52
  • It's not that it crashes it's that I get a kernel parameters invalid (or something like that I don't remember the exact error name and I don't have access to a cuda capable machine right now). Essentially it mean that the grid size and block size I passed were in invalid. – Curtis Bezault May 13 '15 at 17:11
  • In the actual code I check all the returns for errors and whatnot and nothing turns up. – Curtis Bezault May 13 '15 at 17:12
  • I'm not sure (@RobertCrovella will be able to give some more profound advice here), but ... try choosing the grid size as a *multiple* of the block size. (From a quick search, I did not find a definite statement about that, but I think that I once read that this might be beneficial (or even necessary???, @Robert?)) – Marco13 May 13 '15 at 17:42

1 Answers1

1

You have block size and grid size mixed up when calling the kernel.

set_start<<<blockSize, gridSize>>>(H_d, HU_d, HV_d, E_d, Na);

should read:

set_start<<<gridSize, blockSize>>>(H_d, HU_d, HV_d, E_d, Na);

Because of this bug you are actually trying to launch a grid of size blockSize, and blocks of size gridSize. It would appear that the maximum size of a block on your GPU is 1024 threads, so launching blocks of 33x33 fails.

Jez
  • 1,761
  • 11
  • 14
  • Damn, I feel dumb, thanks for catching that. Those were actually the only two kernels in my code I did that *facepalm*. Even in the code I wrote up of hand to demonstrate my problem didn't do that... – Curtis Bezault May 14 '15 at 18:31