I have a Nvidia GeForce GTX 960M graphics card, which has the following specs:
- Multiprocessors: 5
- Cores per multiprocessor: 128 (i.e. 5 x 128 = 640 cores in total)
- Max threads per multiprocessor: 2048
- Max block size (x, y, z): (1024, 1024, 64)
- Warpsize: 32
If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time. However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently. So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).
My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?