0

Base on my current understanding of CUDA and GPU computing, each thread block has to be run within one Streaming Multiprocessor. Each streaming multiprocessor has a fixed number of cores (which is equal to the maximum number of active threads on a streaming multiprocessor). Then my question is what will happened if we have a thread block with larger number of threads than the number of cores on one streaming multiprocessor.

For example, GeForce GTX 980 has 16 streaming multiprocessor, each of which has 128 cores. So the maximum number of active threads on one streaming multiprocessor would be 128. But when programming in CUDA, we can launch a kernel with <<<1,512>>>, which means one thread block with 512 threads. Then how will the GPU execute this kernel? Will it run all 512 threads on one streaming multiprocessor or will it split it across multiple streaming multiprocessors? If it runs on one streaming multiprocessor, how could this happen? Will it run 128 threads first, then the next 128 threads, then the next...

Please correct me if any of my understanding is not correct. Thanks.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Negelis
  • 376
  • 4
  • 17
  • This question has been asked many times before here on [SO]. I think the linked duplicate answers all your questions. – talonmies Nov 05 '15 at 16:01
  • 1
    I would also suggest reading the relevant section of [the programming guide](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture) and perhaps any of the available whitepapers, such as [this one](https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf) pp.6-7. The unit of GPU execution scheduling is the **warp** (32 threads) not the entire threadblock. All currently supported GPUs have at least 32 cores per SM. – Robert Crovella Nov 05 '15 at 16:04

0 Answers0