I have a bit of C++ code that has become a somewhat useful FFT library over time, and it has been made to run decently fast using SSE and AVX instructions. Granted, it's all only based on a radix-2 algorithm, but it still holds up. My latest itch to scratch is making the butterfly calculations work with FMA instructions. The basic radix-2 butterfly consists of 4 multiplies, and 6 additions or subtractions. A simple approach would involve replacing 2 of the additions and subtractions and 2 multiplies with 2 FMA instructions, resulting in a mathematically identical butterfly, but there are apparently better ways of doing this:
ci1 = ci1 / cr1
u0 = zinr(0)
v0 = zini(0)
r = zinr(1)
s = sini(1)
u1 = r - s * ci1
v1 = r * ci1 + s
zoutr(0) = u0 + u1 * cr1
zouti(0) = v0 + v1 * cr1
zoutr(1) = u0 - u1 * cr1
zouti(1) = v0 - v1 * cr1
The author replaces all 10 adds, subs, and mults with 6 FMA's, provided that the imaginary part of the twiddle factor is divided by the real part. Part of the text reads "Note that cr1 != 0". Which is essentially my problem in a nutshell. The math seems to work just as advertised for all twiddle factors except when the real twiddle is zero, in which case, we end up dividing by zero. Where efficiency is absolutely critical here, branching code when cr1 == 0 to a different butterfly isn't a good option, especially when we're using SIMD to process multiple twiddles and butterflies at once, where perhaps only one element of cr1 == 0. What my gut is telling me should be the case, is that when cr1 == 0, cr1 and ci1 should be some other values entirely and the FMA code will still result in the correct answer, but I cannot seem to figure this out. If I could figure it out, it would be a relatively straightforward thing to modify the precomputed twiddle factors for FMA butterflies and we also could, of course, avoid the division operation at the start of the butterfly.