Base on my current understanding of CUDA and GPU computing, each thread block has to be run within one Streaming Multiprocessor. Each streaming multiprocessor has a fixed number of cores (which is equal to the maximum number of active threads on a streaming multiprocessor). Then my question is what will happened if we have a thread block with larger number of threads than the number of cores on one streaming multiprocessor.
For example, GeForce GTX 980 has 16 streaming multiprocessor, each of which has 128 cores. So the maximum number of active threads on one streaming multiprocessor would be 128. But when programming in CUDA, we can launch a kernel with <<<1,512>>>, which means one thread block with 512 threads. Then how will the GPU execute this kernel? Will it run all 512 threads on one streaming multiprocessor or will it split it across multiple streaming multiprocessors? If it runs on one streaming multiprocessor, how could this happen? Will it run 128 threads first, then the next 128 threads, then the next...
Please correct me if any of my understanding is not correct. Thanks.