Sometimes ago I encountered the same problem. FFTW for my dataframe was executed in 14 ms (forward, some calculations, and backward), while straightforward byte (or short) to float array conversion took 12-19 ms. So I've made SSE function to convert bytes to floats (4 elements per cycle), and have got significant speed gain - now conversion is accomplished in 2.2-5 ms.
If you compiler can use autovectorization, try it first.
If not, write simple conversion function with intrinsics.
I've used inline assembler (MOVD, PUNPCKLBW, PUNPCKLWD, CVTDQ2PS, MOVAPS command sequence).
procedure BytesToSingles(Src, Dst: Pointer; Count: Integer);
asm
//EAX = Src pointer to byte array
//EDX = Dst pointer to float array !!! 16 byte-aligned !!!
//ECX = Count (multiple of four)
SHR ECX, 2 // 4 elements per cycle
JZ @@Exit
PXOR XMM7, XMM7 // zeros
@@Cycle:
MOVD XMM1, [EAX] // load 4 bytes
PUNPCKLBW XMM1, XMM7 // unpack to words
PUNPCKLWD XMM1, XMM7 // words to int32
CVTDQ2PS XMM0, XMM1 // convert integers to 4 floats
MOVAPS [EDX], XMM0 // store 4 floats to destination array
ADD EAX, 4 // move array pointers
ADD EDX, 16
LOOP @@Cycle
@@Exit:
end;
Note that FFT implementation on 8-bit data will suffer from numerical error issues, as Paul R wrote in comment.