Kiss_fft is a pretty simple FFT implementation; it's almost textbook. It breaks down larger DFT's into repeated smaller FFT's, by factoring the FFT size. It has optimized butterfly functions for radix 2,3,4 and 5.
The 2,3 and 5 butterflies are small primes, commonly found when factoring, but the radix-4 butterfly is an optimized case that saves a few multiplications compared to using the radix-2 twice*. For instance, for a 32-point FFT, the size 32 is factored into 4x4x2 - two stages of kf_bfly4
followed by one of kf_bfly2
However, this means you can only have one kf_bfly2
stage. If your FFT size was a multiple of 4, you wouldn't have had two kf_bfly2
stages, but a single kf_bfly4
stage. And that also means that the kf_bfly2
function works on "arrays" of length 1.
In code, the declaration is
static void kf_bfly2(kiss_fft_cpx * Fout, const size_t fstride, const kiss_fft_cfg st, int m)
where Fout
is an "array" of size m
, i.e. always 1. The butterfly loops over Fout
, and the compiler of course can't do the numerical analysis to show that m==1
. But since this is the last butterfly, it is called quite often.
Q1) Is my analysis correct? Is kf_bfly2
indeed called only with m==1
?
Q2) Assuming that this is indeed a missed optimization, there are two possible approaches. We could just drop the m
parameter from kf_bfly2
, or we could change the factor analysis to factor 32 as 2x4x4 (move kf_bfly2
up front, call it once at top level for arrays of size 4x4=16). What would be best?
[*] The radix-4 butterfly ends up having factors +1,-1, +i and -i, which can be implemented as additions and subtractions instead. See Why is the kiss_fft's forward and inverse radix-4 calculation different?