Why cuFFT is "slow" on K40?

Question

I've compared a simple 3D cuFFT program on both a GTX 780 and a Tesla K40 in double precision mode.

On the GTX 780 I measured about 85 Gflops, while on the K40 I measured about 160 Gflops. These results baffled me: the GTX 780 ha 166 Gflops of peak theoretical performance while the K40 has 1.4 Tflops.

The fact that the effective performance of cuFFT on the K40 is so distant from the theoretical peak performance also comes from the graphs created by Nvidia at this link.

Can someone explain to me why this happens? Is there a limit for the cuFFT library? Maybe some cache motivations...

Memory bandwith limits? I doubt an FFT has anywhere like enough FLOP per memory transaction to hit peak arithmetic throughput. — talonmies, Dec 16 '15 at 10:55
As talonmies indicates, it's a common misconception to think that all codes are compute-limited. In fact, most extant HPC codes are memory bandwidth limited. Therefore the compute limit of a GPU is only relevant as a performance predictor for codes that are compute bound. If the code is memory bound, then the memory bandwidth ratio of the 2 GPUs will be a more relevant predictor. And if the code is sometimes compute bound and sometimes memory bound, then the actual predictor might be a ratiometric combination of the bandwidth limit and the compute limit of the GPUs in question. — Robert Crovella, Dec 16 '15 at 15:04
@RobertCrovella: I would guess that the GTX780 is double precision arithmetic limited and K40 is memory bandwidth limited (their memory bandwidth is roughly the same, without allowing for what ECC does to the K40 if it is on) — talonmies, Dec 17 '15 at 13:28
Yes, agreed, so in addition to my "ratiometric" statement you could also say that another reason you may not be able to use a single metric to compare from one GPU to the next is that the limiting factor may change going from one GPU to the next. I guess that was probably implicit in your original comment. And obviously this could also depend on whether we are talking about a `float` or `double` FFT -- although this question is specific to `double`. — Robert Crovella, Dec 17 '15 at 13:32

talonmies · Accepted Answer · 2015-12-24T15:32:05.847

The very short answer is that a double precision FFT on a GTX 780 is most likely arithmetic instruction throughput limited, but the same FFT operation is memory bandwidth limited on a Tesla K40.

The slightly longer answer is that a K40 has about 288 Gb/s peak memory bandwidth, which is 36 Gwords/s for an 8 byte type like an IEEE 754 float64. The arithmetic throughput of the FFT will be limited to the number of FLOP which it can execute for that memory throughput. Hitting peak double FLOP/s would require something approaching 40 double precision operations per memory transaction. Clearly an FFT isn¨t arithmetically intensive enough, and the result is a much lower peak arithmetic throughput.

On the GTX 780, which has about the same memory bandwith as the K40, but about 8 times lower peak double precision throughput, it seems that it is possible to get closer to the arithmetic peak at the available memory bandwith.

Why cuFFT is "slow" on K40?

1 Answers1