what is the fastest FFT library for iOS/Android ARM devices?

Question

What is the fastest FFT library for iOS/Android ARM devices? And what library to people typically use on iOS/Android platforms? I'm guessing vDSP is the library most frequently used on iOS.

EDIT: my code is at http://anthonix.com/ffts and uses the BSD license. It runs on Android and iOS, and it is faster than libav, FFTW and vDSP.

EDIT2: if anyone can provide access to a POWER7 machine (or other machines) please email me. It would be much appreciated.

Cheers,

Welcome to Stackoverflow! If you find a response is helpful, please up vote it. If the response successfully answers your question, please click the green check mark next to it to accept the answer. Also please look at http://stackoverflow.com/questions/how-to-ask for advice on how to write a good question — Kurtis Nusbaum, Nov 03 '11 at 22:37
I'm confused -- why are you benchmarking performance for interleaved formats? vDSP operates on split complex data, because it is the preferred layout for many other signal processing operations on complex data. Is the cost of mapping between these layouts accounted for in your benchmark? — Stephen Canon, Nov 04 '11 at 01:27
Stephen: yes the cost is accounted for; I'm performing the FFT as per 'Usage Case 2: Fast Fourier Transforms' in the Apple developer library article 'Using the Accelerate Framework for Data Processing' (http://developer.apple.com/library/mac/#featuredarticles/AccelerateFrameworkData/_index.html). I'm fairly new to signal processing.. why is split format the preferred layout? What other libraries use it? I've only used a few other libraries, such as FFTW, and vDSP has been the only library that uses split format. — Anthony Blake, Nov 04 '11 at 01:37
Suppose you want to multiply the signal by a complex value (or perform any other operation beyond addition, really); if you use an interleaved format, a large number of permutes may be required to carry it out. With a split format, those permutes are avoided. — Stephen Canon, Nov 04 '11 at 01:49
Stephen: not on ARM NEON -- vld2 and vst2 enable the permutes to be done for free as the data is loaded/stored. — Anthony Blake, Nov 04 '11 at 01:58
You might want to split this up into two questions, one for iOS and one for Android. The two platforms are different enough (language, etc.) that there will probably be separate libraries for both. Also, I'm very surprised that you claim better performance than the Accelerate framework on iOS, because that's tuned by some fairly knowledgable engineers at Apple for their specific hardware. They claim a 5X improvement of their stuff over FFTW on ARM. — Brad Larson, Nov 04 '11 at 01:58
Brad: it doesn't surprise me that Accelerate was 5X faster than FFTW on ARM; e.g., FFTW 3.3.1 uses vmul for the conjugate function on ARM -- as I mentioned, FFTW on ARM can't be considered a serious FFT library. And I'll split the question up, thanks. — Anthony Blake, Nov 04 '11 at 03:02
Note that `vld2` and `vst2` don't actually perform the permute "for free" on every ARM processor; there is frequently a performance penalty associated with using them instead of `vldmia/vstmia` or `vld1/vst1`. Note also that even if they did perform it for free, that doesn't help on other architectures (and would not be guaranteed to be "free" on all future ARM architectures either). Apple is providing a stable API that can deliver good performance across current and future architectures without requiring developers to change their code. — Stephen Canon, Nov 04 '11 at 09:37
@Stephen I've been able to use the interleaving/de-interleaving memory operations to compute FFTs quite successfully, as the graph above shows. On the other SIMD machines I've run my code that don't have memory interleaving/de-interleaving operations -- namely those implementing AVX and SSE -- my code was *much* faster than vDSP. Their NEON code is much better than their SSE code, but its by no means the best. — Anthony Blake, Nov 04 '11 at 10:24
@AnthonyBlake: I don't mean to suggest that it's "the best" (indeed, I don't believe I ever said that). I'm asserting is that a split-complex layout is more conducive to generally "good" performance for a variety of signal processing computations (not necessarily FFTs) on diverse architectures. — Stephen Canon, Nov 04 '11 at 11:49
@StephenCanon Well I was asking if there was anything better than vDSP that I should know about -- I agree that vDSP has pretty good performance on ARM NEON. And I agree that split format makes the computation easier, and I anticipate my code would run even faster with split format, which I'll try when I get some time. — Anthony Blake, Nov 04 '11 at 12:06
The Accelerate framework wasn't available prior to iOS 4.0. So the most commonly used FFT on iOS may well be the one used in the aurioTouch sample app on Apple's developer site, which is quite slow compared to the one in vDSP. — hotpaw2, Nov 04 '11 at 16:24
@hotpaw2 : interesting.. thats the sort of info I was hoping to learn , thanks — Anthony Blake, Nov 04 '11 at 22:30
@AnthonyBlake : How do you calculate the megaflops? If you can FFT 65536 samples 100 times in 1 second, 65536 * 100 = 6.55 megaflop? Is it so? — Jake 'Alquimista' LEE, Nov 15 '11 at 18:52
@Jake mflops = 5 N log2(N) / (time for one FFT in microseconds) — Anthony Blake, Nov 15 '11 at 22:12
@Jake Take a look at benchfft for info on benchmarking FFTs (http://www.fftw.org/speed/method.html) — Anthony Blake, Nov 15 '11 at 22:12
ne10 https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/ne10-fft-feature-radix-3-and-radix-5-fft-are-supported-neon-optimization-significant-performance-improvement-by-neon-optimization — Alok Prasad, Sep 15 '19 at 04:29

score 11 · Answer 1 · answered Nov 25 '11 at 13:19

11

Here is a page benchmarking different fft algorithms on ARM:

http://pmeerw.dyndns.org/blog/programming/neon3.html

From that page the fastest FFT implementation is LibAv, which have a Neon optimized fft http://libav.org/

answered Nov 25 '11 at 13:19

Martin Mogensen

111
2

Interesting.. do you know if the libav FFT can be compiled by itself? And what range of sizes and types of transform it computes? – Anthony Blake Nov 25 '11 at 13:39
Disregard the last comment; I had a look at the source code. I'll try and benchmark it versus my code next week. – Anthony Blake Nov 25 '11 at 13:46
Antony, any results of sfft vs libav? I wasn't able to find anything faster than libav, the only part that I'd like to improve is that they don't have armv6 optimized version (neon only). Surprisingly, even the C-only version performs very well, but probably it could be doubled in speed with proper asm. – Pavel P Apr 07 '12 at 05:32
@Pavel I just benchmarked FFTS against libav, and FFTS was faster, in most cases by a factor of at least 2 – Anthony Blake Nov 18 '12 at 23:00
@AnthonyBlake that's surprising. I'm reading my own question here more than a year later by the way :) For my own need I wasn't able to see anything close to libav/ffmpeg's fft implementation. I use it for 16bit ints in sound processing. Are your tests using floats/doubles or ints? I'm mostly interested in arm-neon optimized version, and it's extremely well optimized. – Pavel P Dec 03 '13 at 05:53

score 4 · Answer 2 · answered Nov 18 '12 at 12:55

I've compared many NEON optimized FFT libraries on ARM Cortex-A9, and "libav" is certainly the fastest FFT code, but it is: - single-threaded, - only supports 1D FFTs, - only supports power-of-2 dimensions, - and doesn't have various optimizations for real input/output (it is only a complex-to-complex FFT).

On the other hand, "FFTW" (either the official version or the Vesperix version) is multi-threaded, supports 2D FFTs, supports non-power-of-2 dimensions with very little penalty, and has full optimizations for real input/output instead of just complex input/output.

So depending on your FFT requirements, FFTW might be faster for your project due to the extra features, but if you only need the FFT that libav provides (or you write the extra features yourself using NEON and multi-threading), then libav is actually the fastest 1D Complex-to-Complex FFT code.

To give you an indication, it seems that the FFTW NEON optimizations were performed by a student of the guy who performed the libav NEON optimizations. So would you rather the code from the student or the mentor ;-)

Another issue is that libav uses an LGPL license whereas FFTW uses a GPL license so is more restrictive, unless if you are willing to pay a large sum of money to purchase a proper license for FFTW.

(Personally, I ended up writing my own 2D & real-data features using NEON & multi-threading on top of libav's 1D FFT, but it was a lot of effort since I wasn't an FFT expert!)

I just benchmarked libav's NEON enabled FFT against FFTS, and FFTS was the fastest, by at least a factor of 2 in most cases. But libav was a bit faster than FFTW. — Anthony Blake, Nov 18 '12 at 22:59
Did the FFT benchmarking include results from Accelerate framework ? — kiranpradeep, Apr 10 '15 at 13:58
Is there any information on porting FFTW to an embedded MCU like Arm Cortex M4? — djsg, May 29 '21 at 15:28

score 1 · Answer 3 · answered Nov 21 '12 at 23:44

1

Try also Cricket FFT. It also have Neon optimizations, and has very permissive license - zlib.

answered Nov 21 '12 at 23:44

Mārtiņš Možeiko

12,733
2
45
45

what is the fastest FFT library for iOS/Android ARM devices?

3 Answers3

Linked