How does Nvidia's Fermi GPU issue threadblocks to streaming multiprocessor

Question

Assume I have 8 threadblocks and my GPU has 8 SMs. Then how does GPU issue this threadblocks to the SMs?

I found some programs or articles suggest a breadth-first manner, that is , each SM runs a threadblock in this example. However, according to a few documents, increasing occupancy may be a good idea if GPU kernels are latency-limited. It might be inferred that 8 threadblocks will run on 4 or less SMs if it can.

I wonder which one is the reality. Thanks in advance.

score 2 · Answer 1 · answered Feb 03 '13 at 15:35

2

It's hard to tell what the GPU is doing exactly. If you have a specific kernel you're interested in, you could try reading and storing the %smid register for each block.

An example of how to do this is given here.

answered Feb 03 '13 at 15:35

Pedro

1,344
9
17

Ok.. So Nvidia doesn't release information about this. Maybe I will try this experiment. Thanks! – Antony Yu Feb 06 '13 at 08:12

Nikolaos Giotis · Answer 2 · 2013-10-13T17:11:45.217

You ask the wrong question: you shouldn't worry about how hardware allocates thread-blocks to SMs. That's GPU's responsibility. In fact, since their programming model makes no assumptions as for which blocks will run on which SMs, you get scalability across a pool of computing devices/future generations.

Instead, you should try to feed GPU with the optimal number of thread-blocks. That's non-trivial, since it's subject to many restrictions

How does Nvidia's Fermi GPU issue threadblocks to streaming multiprocessor

2 Answers2