I'm in trying to improve the performance of my code using asynchronous memory transfer overlapped with GPU computation.
Formerly I had a code where I created an FFT plan, and then make use of it multiple times. In such situation the time invested in creating the CUDA FFT plan is negligible althought according to this earlier post it could be quite significant.
Now that I move to streams, what I'm doing is creating the "same" plan "multiple times" and then setting the CUDA FFT stream. According to the answers given by some of you in this other post this is wasteful. But, is there any other way to do it?
NOTE: I'm acquiring the data in real-time, so launching a "batch" CUDA FFT is out of the question. What I'm doing is to create and lauch a new CUDA stream as a result of a complete pulse transmission.
NOTE 2: I was also considering using a "pool" of "CUDA Streams/FFT Plans" instead but I don't think that would be an elegant, sensible solution, any thoughts?
Is there otherwise a way to "copy" an "existent" fft plan before I assign the CUDA Stream?
Thanks guys!/gals? Hopefully meet some of you in San Jose. =)
Omar