Assume I have 8 threadblocks and my GPU has 8 SMs. Then how does GPU issue this threadblocks to the SMs?
I found some programs or articles suggest a breadth-first manner, that is , each SM runs a threadblock in this example. However, according to a few documents, increasing occupancy may be a good idea if GPU kernels are latency-limited. It might be inferred that 8 threadblocks will run on 4 or less SMs if it can.
I wonder which one is the reality. Thanks in advance.