2

I am working on an application which requires to Fourier Transform batches of 2-dimensional signals, stored using single-precision complex floats.

I wanted to test the idea of dissecting those signals into smaller ones and see whether I can improve the efficiency of my computation, considering that FLOPS in FFT operations grow in an O(Nlog(N)) fashion. Of course different signal sizes (in memory) may experience difference FLOPS/s performance, so in order to really see if this idea can work I made some experiments.

What I observed after doing the experiments was that performance was varying very abruptly when changing the signal size, jumping for example from 60 Gflops/s to 300 Gflops/s! I am wondering why is that the case.

I ran the experiments using:

Compiler: g++ 9.3.0 ( -Ofast )
Intel MKL 2020 (static linking)
MKL-threading: GNU

OpenMP environment:

export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=20

Platform:

Intel Xeon Gold 6248

https://ark.intel.com/content/www/us/en/ark/products/192446/intel-xeon-gold-6248-processor-27-5m-cache-2-50-ghz.html

Profiling tool:

Score-P 6.0

Performance results:

To estimate the average FLOP rates I assume: # of Flops = Nbatch * 5*N*N*Log_2( N*N )

When using batches of 2D signals of size 201 x 201 elements (N = 201), the observed average performance was approximately 72 Gflops/s.

Then, I examined the performance using 2D signals with N = 101, 102, 103, 104 or 105. The performance results are shown on the figure below.

enter image description here

I also examined experiments with smaller size such as N = 51, 52, 53, 54 or 55. The results are again shown below.

enter image description here

An finally, for N = 26, 27, 28, 29 or 30.

enter image description here

I performed the experiments two times and the performance results are consistent! I really doubt it is noise... but again I feel is quite unrealistic to achieve 350 Gflops/s, or maybe not???

Has anyone experienced similar performance variations, or have some comments on this?

  • On average, the peak performance drops when going to smaller and smaller sizes, for example is `369 Gflops/s` in figure 1, and less than `200` in figure 3. In figure 2 the performance is in the middle. I think this one is reasonable, considering that arithmetic intensity is higher in larger signals! – Andreas Hadjigeorgiou Nov 04 '21 at 09:48
  • 3
    The higher performances seem to correspond to sizes that decompose (factorise) well, e.g. 101 and 103 are prime, 102 has a factor of 17 in it. FFTs generally work better when they decompose into small primes. As an experiment, try zero-padding to a larger size which factorises nicely (2^n ideally). – Paul R Nov 04 '21 at 11:05
  • 1
    @PaulR Yes, you have a good point! And actually in the last figure, the lowest performance is obtained when N=29, which is the only prime number in the range 26-30. That's a good one, thanks!! – Andreas Hadjigeorgiou Nov 04 '21 at 12:05
  • @AndreasHadjigeorgiou Also, do you warm up your CPU right before measuring performances ? – Soleil Nov 04 '21 at 15:04
  • @Soleil Yes, actually other computations are involved in the applications so the CPU is warmed up pretty well ;) But as I said, I repeated the experiments to make sure my performance results are not determined by noise. – Andreas Hadjigeorgiou Nov 04 '21 at 15:30
  • 1
    @AndreasHadjigeorgiou I'm sure the repetition of your experiment gives consistent results. The warmup is merely about the Gflops rather than the aperture length. Also, the CPU frequencies may change quickly, so just make sure it's not dropping. Usually I add a warmup loop in the code right before the critical part, and I have dedicated tests for this; which is better than "other computations from the application", which may be inconsistent in delay, intensity, etc. Intel advisor may help you to increase the Gflops. – Soleil Nov 04 '21 at 15:40
  • Can you try with the latest MKL which is oneMKL 2021.4.0? Maybe you can also refer to the article https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-ipp-choosing-an-fft.html which helps you to determine which FFT, Intel® MKL, or Intel® IPP is best suited for your application. – Vidyalatha_Intel Nov 16 '21 at 06:51
  • @Vidyalatha_Intel Thanks! So, Intel supports two different FFT libraries, didn't know that? What is the purpose of this, could you please give some comments? – Andreas Hadjigeorgiou Nov 16 '21 at 07:57

1 Answers1

0

You can use FFT from either Intel MKL or Intel IPP(Intel® Integrated Performance Primitives ) libraries. So as mentioned earlier in the comments section, the article link which I have given helps to determine which library is best suited for your application.

If you are working on applications that are related to engineering, scientific and financial applications you can go with the Intel MKL library, and if you are working with imaging, vision, signal, security, and storage applications, the Intel IPP library helps in speed performance. Intel® MKL is suitable for large problem sizes typical to FORTRAN and C/C++ high-performance computing above-mentioned applications for MKL. Intel® IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications For complete details please refer https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-ipp-choosing-an-fft.html https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top.html https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top.html