CUDA performance improves when running more threads than there are cores

Question

Why does performance improve when I run more than 32 threads per block?

My graphics card has 480 CUDA Cores (15 MS * 32 SP).

score 8 · Answer 1 · answered Dec 07 '12 at 23:15

Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.

The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.

A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.

The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.

The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.

NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.

Thank you, Greg. Very useful info. Why is 50-66% a good target? — Roger Dahl, Dec 08 '12 at 00:15
@RogerDahl: In order to reach a high occupancy your kernel must use very few registers (and not much local memory). This means that basically only the simplests of kernels can hit 100%, which makes aiming for that unpractical in most situations. Furthermore the performance benefits of reaching 100% instead of 50% aren't really big (most latencies will be hidden even by 50% occupancy). So aiming for 50% allows you to actually do stuff, while still getting most of the performance — Grizzly, Dec 10 '12 at 20:52
As @Grizzly replied occupancy is a tradeoff between warps and other resources. 50-66% tends to be where you have sufficient active warps such that at least 1 warp is eligible per scheduler per cycle. If the kernel is math intensive and has a lot of ILP occupancy can be decreased and still cover latency. If the kernel is memory bound then occupancy generally needs to be increased. — Greg Smith, Dec 10 '12 at 21:15

score 5 · Answer 2 · answered Dec 07 '12 at 16:45

5

Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.

answered Dec 07 '12 at 16:45

Roger Dahl

15,132
8
62
82

5

To add to this, having more threads than cores in flight on an SM unit allows the GPU to hide memory access latency. If you have exactly as many threads running as you have cores, then when a thread accesses global memory it has to wait several hundred clock cycles before the data is actually received. In the meantime, if there are no extra threads, that SM unit will be idle. If there are extra threads, the SM unit will pause the threads waiting on the memory access and switch to another set of threads with more work to do. This effectively hides the expense of accessing global memory. – Brendan Wood Dec 07 '12 at 18:29
Brendan's answer is correct, but it really doesn't have much to do with the pipeline depth. It's the latency of the memory accesses that it's using the multiple threads per core to hide, not the pipe depth. You can run a pipelined CPU core (even a CUDA core) at 100% efficiency with only 1 thread per core if your code doesn't need to access memory. Where pipe depth matters is in the execution latency of instructions that require the pipe to be flushed (such as branches,) but having multiple threads per core usually doesn't necessarily help with that. – reirab Jun 12 '13 at 21:15
@reirab: Remember that there is a latency in the pipeline that corresponds to the pipeline depth. With a deep pipeline, it becomes hard to find enough instruction level parallelism (ILP) to saturate a core. So, without thread level parallelism (TLP), you get stalls because all remaining instructions require, as input, something that is still in the pipeline. – Roger Dahl Jun 13 '13 at 04:22

reirab · Answer 3 · 2013-06-12T21:19:00.527

The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.

CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.

CUDA performance improves when running more threads than there are cores

3 Answers3