I've compared a simple 3D cuFFT program on both a GTX 780 and a Tesla K40 in double precision mode.
On the GTX 780 I measured about 85 Gflops, while on the K40 I measured about 160 Gflops. These results baffled me: the GTX 780 ha 166 Gflops of peak theoretical performance while the K40 has 1.4 Tflops.
The fact that the effective performance of cuFFT on the K40 is so distant from the theoretical peak performance also comes from the graphs created by Nvidia at this link.
Can someone explain to me why this happens? Is there a limit for the cuFFT library? Maybe some cache motivations...