CUDA fft - cooley tukey, how is parallelism exploited?

Question

I know how the FFT implementation works (Cooley-Tuckey algorithm) and I know that there's a CUFFT CUDA library to compute the 1D or 2D FFT quickly, but I'd like to know how CUDA parallelism is exploited in the process.

Is it related to the butterfly computation? (something like each thread loads part of the data into shared memory and then each thread computes an even term or an odd term?)

score 6 · Accepted Answer · answered Sep 10 '12 at 09:48

I do not think they use Cooley-Tuckey algorithm because its index permutation phase makes it not very convenient for shared-memory architectures. Additionally this algorithm works with power-of-two memory strides which is also not good for memory coalescing. Most likely they use some formulation of Stockham self-sorting FFT: for example Bailey's algorithm.

What concerns the implementation, you are right, usually one splits a large FFT into several smaller ones which fit perfectly within one thread block. In my work, I used 512- or 1024-point FFTs (completely unrolled of course) per thread block with 128 threads. Typically, you do not work with a classical radix-2 algorithm on the GPU due to large amount of data transfers required. Instead, one chooses radix-8 or even radix-16 algorithm so that each thread performs one large "butterfly" at a time. For example implementations, you can also visit Vasily Volkov page, or check this "classic" paper.

CUDA fft - cooley tukey, how is parallelism exploited?

1 Answers1