1

Kiss_fft is a pretty simple FFT implementation; it's almost textbook. It breaks down larger DFT's into repeated smaller FFT's, by factoring the FFT size. It has optimized butterfly functions for radix 2,3,4 and 5.

The 2,3 and 5 butterflies are small primes, commonly found when factoring, but the radix-4 butterfly is an optimized case that saves a few multiplications compared to using the radix-2 twice*. For instance, for a 32-point FFT, the size 32 is factored into 4x4x2 - two stages of kf_bfly4 followed by one of kf_bfly2

However, this means you can only have one kf_bfly2 stage. If your FFT size was a multiple of 4, you wouldn't have had two kf_bfly2 stages, but a single kf_bfly4 stage. And that also means that the kf_bfly2 function works on "arrays" of length 1.

In code, the declaration is

static void kf_bfly2(kiss_fft_cpx * Fout, const size_t fstride, const kiss_fft_cfg st, int m)

where Fout is an "array" of size m, i.e. always 1. The butterfly loops over Fout, and the compiler of course can't do the numerical analysis to show that m==1. But since this is the last butterfly, it is called quite often.

Q1) Is my analysis correct? Is kf_bfly2 indeed called only with m==1?

Q2) Assuming that this is indeed a missed optimization, there are two possible approaches. We could just drop the m parameter from kf_bfly2, or we could change the factor analysis to factor 32 as 2x4x4 (move kf_bfly2 up front, call it once at top level for arrays of size 4x4=16). What would be best?

[*] The radix-4 butterfly ends up having factors +1,-1, +i and -i, which can be implemented as additions and subtractions instead. See Why is the kiss_fft's forward and inverse radix-4 calculation different?

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • The author, @Mark Borgerding, is active on StackOverflow: https://stackoverflow.com/users/3343/mark-borgerding – Paul R Oct 12 '21 at 13:39
  • @PaulR: I had noticed his comment on the linked question :) – MSalters Oct 12 '21 at 13:46
  • Sorry - I didn't notice that - I just thought it might be worth pinging him as I've seen him answer kiss-fft questions on here in the past. :) – Paul R Oct 12 '21 at 13:48
  • 1
    BTW, see also https://stackoverflow.com/questions/69542873/why-is-the-kiss-ffts-forward-and-inverse-radix-4-calculation-different-part-2. – MSalters Oct 12 '21 at 15:06

1 Answers1

2

Edit

The question has a one-off error. The function doesn't follow the usual convention; m==1 is the upper bound. The input is in fact an array of length 2, not a scalar. The remainder of the answer below is correct, and there's indeed an optimization possible

Optimized code (also for C++)

GitHub

Original answer

To understand the value of "m", it helps to understand how Kiss FFT works: it turns a large FFT recursively into smaller FFT's, by factoring the FFT size. And there are multiple possible factorizations: 32 is 4x4x2 but also 2x4x4.

Now that m parameter in kf_bfly2 and the other butterfly functions is the product of all the factors that follow. So, using 32 again as an example, kf_bfly4 will be called with m=8 and m=2.

I thought that m=1 in kf_bfly2 because in my tests it always happened to be the last factor, if it was a factor. If you use a 221 point FFT, the last factor will be 17, not 2.

But 34 is factored as 2x17, and in that case m==17. There is no special reason for this, this just happens to be an artifact of the implementation of kf_factor. It tests factorization by checking n%4, n%2, n%3, n%5, n%7, n%9, n%11 .... That's not the most effective way, of course. Checking n%9 is pointless as you already found two factors of 3 earlier.

But note from the position of n%4 there is freedom in checking the factors. We can simply check n%2 last. If we do this, then 34 will factor as 17x2 instead of 2x17. This may seem like a trivial optimization, but it does allow us to remove the loop in kf_bfly2. In addition, it now only uses the first twiddle factor - and that is a very convenient {1 + 0i}. So the only multiplication in kf_bfly2 is also eliminated.

The optimizations don't stop here - since kf_bfly2 is now a whole lot smaller, and there is only one call site*, your compiler will likely inline kf_bfly2 into kf_work. And we know from the changed factoring that 2 can only be the last factor, so we can stop the recursion in kf_work as well.

TL;DR

When you change the order of factorization, you unlock multiple optimizations in a row.

  • * Footnote: There's one special case, the OpenMP optimization will use multiple threads for the top-level, if that is small enough. Turning a 2x17 FFT into a 17x2 FFT will mean the second thread is no longer used, but that was a bit pointless anyway for such a small FFT.
MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Considering your kwowledge of kissFFT guts, I would like to kindly ask for your feedback on this currently opened PR: https://github.com/mborgerding/kissfft/pull/84 – Julien M Sep 23 '22 at 19:07
  • @JulienM: I've personally never looked into the fixed-point side, sorry – MSalters Sep 26 '22 at 07:50