Using FMA instructions for an FFT algorithm

Question

I have a bit of C++ code that has become a somewhat useful FFT library over time, and it has been made to run decently fast using SSE and AVX instructions. Granted, it's all only based on a radix-2 algorithm, but it still holds up. My latest itch to scratch is making the butterfly calculations work with FMA instructions. The basic radix-2 butterfly consists of 4 multiplies, and 6 additions or subtractions. A simple approach would involve replacing 2 of the additions and subtractions and 2 multiplies with 2 FMA instructions, resulting in a mathematically identical butterfly, but there are apparently better ways of doing this:

https://books.google.com/books?id=2HG0DwAAQBAJ&pg=PA56&lpg=PA56&dq=radix+2+fft+fma&source=bl&ots=R5XDWyYBVv&sig=ACfU3U0S2n1hcgiP63LTKMxI5Oc85eEZaQ&hl=en&sa=X&ved=2ahUKEwiz_I3PsrToAhVoHzQIHYmVDGIQ6AEwDXoECAoQAQ#v=onepage&q=radix%202%20fft%20fma&f=false

ci1 = ci1 / cr1
u0 = zinr(0)
v0 = zini(0)
r = zinr(1)
s = sini(1)
u1 = r - s * ci1
v1 = r * ci1 + s
zoutr(0) = u0 + u1 * cr1
zouti(0) = v0 + v1 * cr1
zoutr(1) = u0 - u1 * cr1
zouti(1) = v0 - v1 * cr1

The author replaces all 10 adds, subs, and mults with 6 FMA's, provided that the imaginary part of the twiddle factor is divided by the real part. Part of the text reads "Note that cr1 != 0". Which is essentially my problem in a nutshell. The math seems to work just as advertised for all twiddle factors except when the real twiddle is zero, in which case, we end up dividing by zero. Where efficiency is absolutely critical here, branching code when cr1 == 0 to a different butterfly isn't a good option, especially when we're using SIMD to process multiple twiddles and butterflies at once, where perhaps only one element of cr1 == 0. What my gut is telling me should be the case, is that when cr1 == 0, cr1 and ci1 should be some other values entirely and the FMA code will still result in the correct answer, but I cannot seem to figure this out. If I could figure it out, it would be a relatively straightforward thing to modify the precomputed twiddle factors for FMA butterflies and we also could, of course, avoid the division operation at the start of the butterfly.

I cannot read anything through the link you provided. Partially related: if you are interested by efficiency, did you test using radix-4 ? — Damien, Apr 01 '20 at 15:07

score 1 · Answer 1 · answered Mar 28 '20 at 19:09

The book seems to suggest that cr1 != 0 is always true. But unfortunately, it is not always the case (when the rotation angle is PI/2).

I don't think that you can solve this by adjusting the twiddle factors. The only option I see is to use some very small number instead of zero. It could work, but it's ugly, and it may cause inaccuracies in certain cases.

Possible solutions:

Split the loop into two, and handle this center case (where division by zero happens) specially
Instead of dividing by cr1, divide by ci1, and modify the forumula accordingly. This case still has a divison by zero, but it will happen at the first iteration of the loop. So instead of the center, you have to handle the first iteration specially (so only one loop is needed).
Use a different FMA formulation:

Notice, that:

zoutr(1) = u0 - u1 
         = u0 - u1 - (u0 + u1) + (u0 + u1) 
         = u0 - u1 - zoutr(0) + u0 + u1 
         = 2*u0 - zoutr(0)

So, this operation can be done in 1 FMA.

And if you substitute u1 into the expression of zoutr(0):

zoutr(0) = u0 + u1
         = u0 + r*cr1 - s*ci1

This can be done with 2 FMAs.

Calculating zouti can be done in the same manner as zoutr. So this way you need to use 6 FMA operations, which is the same amount of operations that the book has.

(Note, this doesn't mean that this variant will run faster automatically, as it has a different data dependency chain)

Using FMA instructions for an FFT algorithm

1 Answers1