I've developed C code for a 3-dimensional FFT (MKL interface) to run natively on an Intel MIC platform.
Data elements are double precision complex for a complex-to-complex transform. I'm using a padded leading dimension, mkl_malloc() 64-byte alignment, and using radix-2 dimensions for the array The performance I end up with is around 50 Gflop/s.
I can't performance listings anywhere for similar types of transforms. Can anyone tell me if this reasonable (to be satisfied with) on Xeon Phi?