0

This question is also started from following link: shared memory optimization confusion

In above link, from talonmies's answer, I found that the first condition of the number of blocks which will be scheduled to run is "8". I have 3 questions as shown in below.

  1. Does it mean that only 8 blocks can be scheduled at the same time when the number of blocks from condition 2 and 3 is over 8? Is it regardless of any condition such as cuda environment, gpu device, or algorithm?

  2. If so, it really means that it is better not to use shared memory in some cases, it depends. Then we have to think how can we judge which one is better, using or not using shared memory. I think one approach is checking whether there is global memory access limitation (memory bandwidth bottleneck) or not. It means we can select "not using shared memory" if there is no global memory access limitation. Is it good approach?

  3. Plus above question 2, I think if the data that my CUDA program should handle is huge, then we can think "not using shared memory" is better because it is hard to handle within the shared memory. Is it also good approach?

Community
  • 1
  • 1
user1292251
  • 1,655
  • 3
  • 16
  • 16
  • You did note that the 8 blocks figure is the *maximum number of concurrent blocks per MP* – talonmies Apr 04 '12 at 12:44
  • I missed something maybe. I thought the maximum number of concurrent blocks per MP is 8 always (or in the majority cases). Then what is the maximum number of concurrent blocks per MP when there is no any limitation on algorithm(program) side? Is it the number of cores in MP? – user1292251 Apr 05 '12 at 04:58
  • No, it is just 8 concurrent blocks per MP (at least on compute 1.x and 2.x hardware). You can schedule 65535 blocks in each grid dimesion the hardware supports, but only up to 8 run concurrently on each MP on the GPU. – talonmies Apr 05 '12 at 06:15

1 Answers1

3

The number of concurrently scheduled blocks are always going to be limited by something.

Playing with the CUDA Occupancy Calculator should make it clear how it works. The usage of three types of resources affect the number of concurrently scheduled blocks. They are, Threads Per Block, Registers Per Thread and Shared Memory Per Block.

If you set up a kernel that uses 1 Threads Per Block, 1 Registers Per Thread and 1 Shared Memory Per Block on Compute Capability 2.0, you are limited by Max Blocks per Multiprocessor, which is 8. If you start increasing Shared Memory Per Block, the Max Blocks per Multiprocessor will continue to be your limiting factor until you reach a threshold at which Shared Memory Per Block becomes the limiting factor. Since there are 49152 bytes of shared memory per SM, that happens at around 8 / 49152 = 6144 bytes (It's a bit less because some shared memory is used by the system and it's allocated in chunks of 128 bytes).

In other words, given the limit of 8 Max Blocks per Multiprocessor, using shared memory is completely free (as it relates to the number of concurrently running blocks), as long as you stay below the threshold at which Shared Memory Per Block becomes the limiting factor.

The same goes for register usage.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • Thanks for your answer. BTW, where the 128 bytes comes from, in line 7, paragraph 3. Is it a kind of meaningful storage size? – user1292251 Apr 05 '12 at 04:27
  • I found the answer. It's the "Shared Memory Allocation Unit Size". Is it same as you meant? – user1292251 Apr 05 '12 at 05:41
  • Yes. Ideally, `Shared Memory Allocation Unit Size` would be 1 byte. The number 128 represents a tradeoff that NVIDIA have made between that ideal and the goal of maximizing overall performance for the chip. One can speculate that the specific gain was that they did not have to include the 6 lower address lines and corresponding logic that would be required for addressing at 1-byte granularity below 128. – Roger Dahl Apr 05 '12 at 14:01