I'm developing C/C++ software for an embedded Linux system with the AT91SAM9G20 processor from Atmel. I need to quickly compute the FFT using fixed-point (or perhaps floating-point) math using a Linux userspace program. I understand that assembler might be the way to go here with respect to the implementation, and that an additional -mpcu switch might be required when compiling using the gcc compiler. What is the best way to proceed with this implementation, and are there any good book references or optimized FOSS libraries available?
I have to implement some algorithms that also require small FFT lengths (i.e 1024 points) to be applied a number of times and I would wonder if some libraries (such as kissfft) would work just as well. I'm also interested in long FFT lengths, so the FFTW as suggested in an answer below would work well too.
As a related aside to this question, I am also wondering how integer division is handled in an ARM9 Linux userspace program. If I divide two integers (such as 25 / 4), is the division done using soft floating point numbers? I need to also implement some heavy number crunching algorithms, and I am wondering if fixed-point is better to use here than floating point math, and how the gcc compiler really handles things.