3

Assume I have 8 threadblocks and my GPU has 8 SMs. Then how does GPU issue this threadblocks to the SMs?

I found some programs or articles suggest a breadth-first manner, that is , each SM runs a threadblock in this example. However, according to a few documents, increasing occupancy may be a good idea if GPU kernels are latency-limited. It might be inferred that 8 threadblocks will run on 4 or less SMs if it can.

I wonder which one is the reality. Thanks in advance.

Antony Yu
  • 31
  • 2

2 Answers2

2

It's hard to tell what the GPU is doing exactly. If you have a specific kernel you're interested in, you could try reading and storing the %smid register for each block.

An example of how to do this is given here.

Pedro
  • 1,344
  • 9
  • 17
0

You ask the wrong question: you shouldn't worry about how hardware allocates thread-blocks to SMs. That's GPU's responsibility. In fact, since their programming model makes no assumptions as for which blocks will run on which SMs, you get scalability across a pool of computing devices/future generations.

Instead, you should try to feed GPU with the optimal number of thread-blocks. That's non-trivial, since it's subject to many restrictions

Nikolaos Giotis
  • 281
  • 2
  • 23