It looks like my application starting to be (i)FFT-bounded, it doing a lot of 2D correlations for rectangles with average sizes about 500x200 (width and height always even). Scenario is as usual - do two FFT (one per field), multiply complex fields, then one iFFT.
So, on CPU (Intel Q6600, with JTransforms libraly) FFT-transformations eating about 70% of time according to profiler, on GPU (GTX670, cuFFT library) - about 50% (so, there is some performance increase on CUDA, but not what I want). I realize, that it's may be the case that GPU not fully saturated (bandwith limited), but from other case - doing calculation in batches will significantly increase application complexity.
Questions:
- what I can do further to decrease time spent on FFT at least several times?
- should I try FFTW library (at this moment I am not sure that it will give significant gain comparing to JTransforms) ?
- are there any specialized hardware which can be plugged to PC for FFT-conversions ?