8-bit FFT for CPU architectures?

Question

I am looking for an FFT engine that can handle 8-bit real to complex transforms (of size 65K). The need for this is to accelerate a real-time signal processing engine. It is currently limited by 8-bit -> FP32 and FP32 -> 8-bit conversions, as well as the actual FFT being memory bandwidth bound (we're using FFTW at the moment).

I thought that the Spiral project might be able to do this http://spiral.net, but the only code that seems to be available on their webpage is for single or double transforms.

Anyone know of any C or C++ libraries that can do this?

What wrong with converting the input to FP32 and then plugging it into a standard FFT library? — Mysticial, Apr 17 '13 at 17:15
As I said, this is what I am doing at the moment, but this is causing a bottleneck. FFT is a memory bandwidth bound problem, doing everything in native 8-bit reduces the memory throughput required by a factor of 4 and it avoids having to do the conversions which cost time as well. — Maddy Scientist, Apr 17 '13 at 17:44
My bad, I should've read more closely. If you can't find a library to do this for you, you're mostly out of luck unless you're willing to get your hands dirty. — Mysticial, Apr 17 '13 at 17:46
If I understand you correctly, you want to keep all the intermediate data at 8 bits? If so, then that's not going to work. You have log2(N) = 16 butterfly stages to compute and 8 bits just doesn't have the dynamic range for this. Even 24 bit fixed point DSPs would have significant problems with dynamic range and truncation noise with FFTs of this size. Stick with floating point. — Paul R, Apr 17 '13 at 20:14
@PaulR is exactly right. You might as well just generate random 8-bit data; it will be about as accurate and much easier. — Stephen Canon, Apr 20 '13 at 21:42
@PaulR is correct. FPGAs have similar problem. What the FPGAS guys tend to do is to add a bit to their data word width for every FFT stage. This minimises the bits needed to safely represent the data throughout. You could try something similar - use 16 bit ints for the first 8 layers of the FFT before switching to 32 bits for the last layers. — bazza, Apr 24 '13 at 06:09

MBo · Answer 1 · 2013-04-18T07:04:19.013

Sometimes ago I encountered the same problem. FFTW for my dataframe was executed in 14 ms (forward, some calculations, and backward), while straightforward byte (or short) to float array conversion took 12-19 ms. So I've made SSE function to convert bytes to floats (4 elements per cycle), and have got significant speed gain - now conversion is accomplished in 2.2-5 ms.

If you compiler can use autovectorization, try it first.

If not, write simple conversion function with intrinsics.

I've used inline assembler (MOVD, PUNPCKLBW, PUNPCKLWD, CVTDQ2PS, MOVAPS command sequence).

procedure BytesToSingles(Src, Dst: Pointer; Count: Integer);
asm
  //EAX = Src pointer to byte array
  //EDX = Dst pointer to float array !!! 16 byte-aligned !!!
  //ECX = Count (multiple of four)
  SHR ECX, 2           // 4 elements per cycle
  JZ @@Exit
  PXOR XMM7, XMM7      // zeros
@@Cycle:
  MOVD XMM1, [EAX]     // load 4 bytes
  PUNPCKLBW XMM1, XMM7 // unpack to words
  PUNPCKLWD XMM1, XMM7 // words to int32
  CVTDQ2PS XMM0, XMM1  // convert integers to 4 floats
  MOVAPS [EDX], XMM0   // store 4 floats to destination array
  ADD EAX, 4           // move array pointers
  ADD EDX, 16
  LOOP @@Cycle
@@Exit:
end;

Note that FFT implementation on 8-bit data will suffer from numerical error issues, as Paul R wrote in comment.

score 2 · Answer 2 · edited May 23 '17 at 11:50

You do not want to do all the processing in fixed point. You data will turn to mush in an FFT of that size. Technically, you could use 32bit fixed point and keep all your dynamics, but you'd still have to convert the data and it will be slower than using floats (you tagged SSE, so I assume you are on an intel machine having an FPU). I base my opinions on my work creating kissfft

Focus instead on speeding up the type conversion. I've not run MBo's assembly code, but it looks like the right approach. I think unrolling might make it faster.

If you are not accustomed to assembly, use SSE2 compiler instrinsics instead. It will be just as fast (assuming decent compiler) and it will make your code more readable and maintainable. This answer will give you most of what you need.

8-bit FFT for CPU architectures?

2 Answers2

Linked