about CUFFT input sizes

Question

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.

How could they managed to do that?

For as far as I know, FFT must provide best perfomance only for 2^a input size.

score 0 · Answer 1 · answered Apr 09 '15 at 23:54

0

This means that input sizes with prime factors larger than 7 would go slower.

answered Apr 09 '15 at 23:54

llukas

359
1
4

Morc · Answer 2 · 2017-03-11T17:42:38.807

The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.

As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).

However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)

Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.

http://mathworld.wolfram.com/FastFourierTransform.html

https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

about CUFFT input sizes

2 Answers2