0

For a constant block size of 128 (cores per MP):

I did a performance comparison of a grid having a dim3 2D array with dimensions dim3(WIDTH, HEIGHT) versus a flattened 1D array of int = WIDTH * HEIGHT, where WIDTH, HEIGHT can be any arbitrarily large values representing a 2D array/matrix so long as "int" doesn't overflow in C.

According to my research, such as this answer here: Maximum blocks per grid:CUDA only 65535 blocks should be supported in a single dimension.

Yet with WIDTH = 4000, HEIGHT = 4000, the speed results end up essentially the same over multiple trials regardless of whether the grid has 1 dimension or 2. Eg: Given gridDim { x = 125000, y = 1, z = 1 } I get the same performance as gridDim { x = 375, y = 375, z = 1 }, with a block size of 128 (computationally expensive operations are performed on the array for each thread).

  1. I thought for the 1D gridDim, any value over 65535 shouldn't even work, going by prior answers. Why is such a large dimension accepted then?
  2. Even if it does work, I thought this should somehow lead to wasted cores. Yet the speed between dim3 and a flattened 1D grid, with threads per block of 128 (# of cores per MP), is the same from my tests. What's the point then of using dim3 with multiple dimensions instead of a single dimension for the grid size?

Could someone please enlighten me here?

Thank you.

talonmies
  • 70,661
  • 34
  • 192
  • 269
user4779
  • 645
  • 5
  • 14

1 Answers1

2

As can be seen in Table 15. Technical Specifications per Compute Capability, the x-dimension is not restricted to 65535 like the other two dimensions, instead it can go up to 2^31 - 1 for all supported compute architectures. As to why this is the case, you might not get a good answer as this seems like an old implementation detail.

The information in the linked SO answer is outdated (as mentioned in the comments). I edited it for future readers.

The dimensionality of the grid does not matter for "wasting cores". The amount of threads per block (together with the use of shared memory and registers) is what is important for utilization. And even there the dimensionality is just to make the code easier to write and read, as many GPU use-cases are not one-dimensional.

The amount of blocks in a grid (together with the amount of blocks that can be fitted onto each SM) can matter for minimizing the tail effect in smaller kernels (see this blog post), but again the dimensionality should be of no concern for that.

I have never seen any information about the dimensionality of the grid or blocks mattering directly to performance in a way that could not be emulated using 1D grids and blocks (i.e. 2D tiles for e.g. matrix multiplication are important for performance, but one could emulate them and use 1D blocks instead), so I view them just as a handy abstraction to keep index computations in user code at a reasonable level.

paleonix
  • 2,293
  • 1
  • 13
  • 29
  • 1
    Thanks this has given me the insights I needed. So essentially I need not overly concern myself with grid dimensions in terms of core utilization, but rather only in terms of how it might be easier to reason about the blocks on a more abstract level. - that is my understanding from this – user4779 Sep 05 '22 at 14:10