Improving the speed of FFT Implementation

Question

I'm a beginner in programming and am currently trying to work on a project requiring Fast Fourier Transform implementation.

I have so far managed to implement the following:

Does anyone have any alternatives and suggestions to improve the speed of the program without losing out on accuracy.

short FFTMethod::FFTcalc(short int dir,long m,double *x,double *y)
{
long n,i,i1,j,k,i2,l,l1,l2;
double c1,c2,tx,ty,t1,t2,u1,u2,z;

/* Calculate the number of points */
n = 1;
for (i=0;i<m;i++) 
    n *= 2;

/* Do the bit reversal */
i2 = n >> 1;
j = 0;
for (i=0;i<n-1;i++) {
  if (i < j) {
     tx = x[i];
     ty = y[i];
     x[i] = x[j];
     y[i] = y[j];
     x[j] = tx;
     y[j] = ty;
  }
  k = i2;
  while (k <= j) {
     j -= k;
     k >>= 1;
  }
  j += k;
}

/* Compute the FFT */
c1 = -1.0; 
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
   l1 = l2;
   l2 <<= 1;
   u1 = 1.0; 
   u2 = 0.0;
   for (j=0;j<l1;j++) {
     for (i=j;i<n;i+=l2) {
        i1 = i + l1;
        t1 = u1 * x[i1] - u2 * y[i1];
        t2 = u1 * y[i1] + u2 * x[i1];
        x[i1] = x[i] - t1; 
        y[i1] = y[i] - t2;
        x[i] += t1;
        y[i] += t2;
     }
     z =  u1 * c1 - u2 * c2;
     u2 = u1 * c2 + u2 * c1;
     u1 = z;
   }
   c2 = sqrt((1.0 - c1) / 2.0);
   if (dir == 1) 
     c2 = -c2;
     c1 = sqrt((1.0 + c1) / 2.0);
  }

/* Scaling for forward transform */
if (dir == 1) {
   for (i=0;i<n;i++) {
      x[i] /= n;
      y[i] /= n;
   }
 } 


   return(1);
}

Unless you need to write it yourself for understanding purposes, FFTW (http://www.fftw.org/) is a great library. It's a self-adjusting, super-fast, and reliable implementation and you can call it from C++ just fine (see http://www.fftw.org/faq/section2.html#cplusplus) — Jeff Foster, Dec 21 '11 at 09:29
I liked FFTReal very much. http://ldesoras.free.fr/prod.html — Draco Ater, Dec 21 '11 at 09:36
Why do you write your own implementation instead of using one of the countless libraries out there, that are likely all faster, better tested, more accurate and have more features? — PlasmaHH, Dec 21 '11 at 10:50
Also look at KissFFT: http://sourceforge.net/projects/kissfft/ - it's completely free, whereas FFTW has nasty licensing issues for anything other than GPL projects — Paul R, Jan 19 '12 at 09:32

Dr. Andrew Burnett-Thompson · Accepted Answer · 2011-12-21T14:19:38.680

I recently found this excellent PDF on the Construction of a high performance FFTs by Eric Postpischil. Having developed several FFTs myself I know how hard it is to compete with commercial libraries. Believe me you're doing well if your FFT is only 4x slower than Intel or FFTW, not 40x! You can however compete, and here's how.

To summarise that article, the author states that Radix2 FFTs are simple but inefficient, the most efficient construct is the radix4 FFT. An even more efficient method is the Radix8 however this often does not fit into the registers on a CPU so Radix4 is preferred.

FFTs can be constructed in stages, so to compute a 1024 point FFT you could perform 10 stages of the Radix2 FFT (as 2^10 - 1024), or 5 stages of the Radix4 FFT (4^5 = 1024). You could even compute a 1024 point FFT in stages of 8*4*4*4*2 if you so choose. Fewer stages means fewer reads and writes to memory (the bottleneck for FFT performance is memory bandwidth) hence dynamically choosing radix 4, 8 or higher is a must. The Radix4 stage is particulary efficient as all weights are 1+0i, 0+1i, -1+0i, 0-1i and Radix4 butterfly code can be written to fit entirely in the cache.

Secondly, each stage in the FFT is not the same. The first stage the weights are all equal to 1+0i. there is no point computing this weight and even multiplying by it as it is a complex multiply by 1, so the first stage may be performed without weights. The final stage may also be treated differently and can be used to perform the Decimation in Time (bit reversal). Eric Postpischil's document covers all this.

The weights may be precomputed and stored in a table. Sin/cos calculations take around 100-150 cycles each on x86 hardware so precomputing these can save 10-20% of the overall compute time as memory access is in this case faster than CPU calculations. Using fast algorithms to compute sincos in one go is particularly beneficial (Note that cos is equal to sqrt(1.0 - sine*sine), or using table lookups, cos is just a phase shift of sine).

Finally once you have your super streamlined FFT implementation you can utilise SIMD vectorization to compute 4x floating point or 2x double floating point operations per cycle inside the butterfly routine for another 100-300% speed improvement. Taking all of the above you'd have yourself a pretty slick and fast FFT!

To go further you can perform optimisation on the fly by providing different implementations of the FFT stages targeted to specific processor architectures. Cache size, register count, SSE/SSE2/3/4 instruction sets etc differ per machine so choosing a one size fits all approach is often beaten by targeted routines. In FFTW for instance many smaller size FFTs are highly optimised unrolled (no loops) implementations targeted for a specific architecture. By combining these smaller constructs (such as RadixN routines) you can choose the fastest and best routine for the task at hand.

Thanks a lot. You have been very helpful. I shall try making the changes. — sagarn, Dec 23 '11 at 09:16
Performance tuning is a bit of a black art. I would suggest creating a test application which runs multiple iterations of different FFT methods and times them, plus compares the result accuracy and speed of the transform to a known FFT implementation (for instance, FFTW). Rather than completely change an implementation, keep it but create new implementations and compare those. You will be surprised what does and does not increase performance. E.g. reducing the number of multiplies may not have as big an effect as ensuring you perform your RAM reads sequentially and as few times as possible! — Dr. Andrew Burnett-Thompson, Dec 23 '11 at 09:27
If the comment has been helpful to you, please vote up. Thanks! :-) — Dr. Andrew Burnett-Thompson, Dec 31 '11 at 10:44
hello Jon, yes it is, however any N inputs can be formulated from 2^N FFTs using the Bluestein method (also discussed in that white paper). 2^Ns are also much easier to optimize and make in-cache hence its a great building block to optimize and re-use elsewhere — Dr. Andrew Burnett-Thompson, Apr 12 '12 at 14:40

score 4 · Answer 2 · answered Dec 21 '11 at 09:44

While I can't give you a performance hint right now, I'd like to give some advice for your optimization that is too long for a comment:

If you haven't done so, write a number of correctness tests for your code right now. Simple tests like "do an FFT of this array and see if the results match the ones I've provided" suffice, but before you optimize code, you need a firm and automated unit test that confirms your optimized code is correct.
Then profile your code to see where the actual bottleneck is. While I suspect the innermost loop for (i=j;i<n;i+=l2) {, seeing is better than believing.

score 4 · Answer 3 · answered Dec 21 '11 at 10:07

4

There are several things I can recommend trying:

Don't swap the input elements, instead calculate the bit-reversed index. This will save you a number of memory reads and writes.
Precalculate the coefficients if you're doing many FFTs of the same size. This will save some computations.
Use radix-4 FFT instead of radix-2. This will result in fewer iterations in the inner loops.

The ultimate answer can, of course, be found by profiling the code.

answered Dec 21 '11 at 10:07

Alexey Frunze

61,140
12
83
180

If I understand you correct, (1) is a bad idea. You're saving some memory operations but you're also randomizing far more of them which is much worse because it destroys the advantages of the CPU caches in the main loop. – J D Apr 12 '12 at 13:24
@JonHarrop: doesn't swapping incur "randomizing" too? You will inevitably access the same data anyway *and* out of order either at swap time or later if there's no swapping. – Alexey Frunze Apr 13 '12 at 05:38
@Alex Yes. You have a choice between random access during the sort (which is the quick initial phase) or the butterflies (which is the slow final phase). For the best performance, you choose to do a little random access up-front so the main computation becomes nice sequential access. – J D Apr 13 '12 at 08:10

hotpaw2 · Answer 4 · 2011-12-21T10:52:14.993

0

This looks a basic radix-2 FFT implementation right out of an old textbook. There are many dozens of decades-old papers on optimizing FFTs in various ways, depending on many factors. For instance, is your data smaller than the CPU cache?

Added: For instance, if the data vector plus a table of coefficients will fit into CPU dcache and/or if multiplies are much slower than memory accesses on your CPU, then precomputing a table of twiddle factors may reduce the total cycle count for repeated use of the FFT. But if not, precomputing might actually be slower. Benchmark. YMMV.

edited Dec 21 '11 at 10:52

answered Dec 21 '11 at 10:07

hotpaw2

70,107
14
90
153

Yes you are right @hotpaw2, I referred to a book called Numerical Recipes in C as I found it the best place to start. This is however just the first attempt at it and I have a lot of optimization to do before I complete the project. Yes the data is smaller than CPU cache. – sagarn Dec 21 '11 at 10:31

Improving the speed of FFT Implementation

4 Answers4