0

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.

How could they managed to do that?

For as far as I know, FFT must provide best perfomance only for 2^a input size.

Aleksandr Ianevski
  • 1,894
  • 1
  • 18
  • 22

2 Answers2

0

This means that input sizes with prime factors larger than 7 would go slower.

llukas
  • 359
  • 1
  • 4
0

The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.

As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).

However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)

Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.

http://mathworld.wolfram.com/FastFourierTransform.html

https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

Morc
  • 381
  • 2
  • 9