0

I have a Nvidia GeForce GTX 960M graphics card, which has the following specs:

  • Multiprocessors: 5
  • Cores per multiprocessor: 128 (i.e. 5 x 128 = 640 cores in total)
  • Max threads per multiprocessor: 2048
  • Max block size (x, y, z): (1024, 1024, 64)
  • Warpsize: 32

If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time. However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently. So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).

My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?

Numaerius
  • 383
  • 3
  • 10

1 Answers1

1

If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time.

That isn't correct. All 640 threads run concurrently. The SM has instruction latency and is pipelined, so that all threads are active and have state simultaneously. Threads are not tied to a specific core and the execution model is very different from a conventional multi-threaded CPU execution model.

However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently.

That may happen, but it is not guaranteed. All blocks will run. What SM they run on is determined by the block scheduling mechanism, and those heuristics are not documented.

So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).

From the answers above, that does not follow either.

My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?

Because threads are not tied to cores, the architecture has a lot of latency and requires a significant number of threads in flight to hide all that latency and reach peak performance. Unfortunately basically none of the theses you suppose in your question are correct or relevant to determining the optimal number of blocks or their size for a given device.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    Why? Occupancy is completely irrelevant to the question. It has nothing to do with the execution model. It is (at best) a crude performance metric. – talonmies Jul 23 '19 at 16:22
  • It is at least a one-way correlation in so far that a low occupancy exactly hinders the ability to hide latencies. But admittedly, the main thing that I know here is that I don't know enough. When you say that occupancy is not relevant, I'll delete the comment. – Marco13 Jul 23 '19 at 16:33
  • I suggest you check Vasily Volkov's work. Especially the part where he shows a simple GEMM kernel running at about 90% peak flops at 12% occupancy by exploring instruction level parallelism. it is a *very* crude metric at best. – talonmies Jul 23 '19 at 17:36