When using pinned memory in ArrayFire I get slow performance.
I've tried various methods of creating pinned memory and creating arrays from it, eg. cudaMallocHost. Using cudaMallocHost w/ cudaMemcpy ways pretty fast (several hundred usec.), but then creating/initializing the arrayfire array was really slow (~ 2-3 sec.). Finally I came up with the following method and the allocation takes ~ 2-3 sec., but it can be moved elsewhere. Initializing the array with the host data is satisfactory (100 - 200 usec.), but now the operations (FFT in this case) are excruciatingly slow: ~ 400 msec. I should add the input signal is variable in size, but for the timing I've used 64K samples (complex doubles). Also, I'm not providing my timing function for brevity, but it isn't the problem, I've timed using other methods and the results are consistent.
// Use the Frequency-Smoothing method to calculate the full
// Spectral Correlation Density
// currently the whole function takes ~ 2555 msec. w/ signal 64K samples
// and window_length = 400 (currently not implemented)
void exhaustive_fsm(std::vector<std::complex<double>> signal, uint16_t window_length) {
// Allocate pinned memory (eventually move outside function)
// 2192 ms.
af::af_cdouble* device_ptr = af::pinned<af::af_cdouble>(signal.size());
// Init arrayfire array (eventually move outside function)
// 188 us.
af::array s(signal.size(), device_ptr, afDevice);
// Copy to device
// 289 us.
s.write((af::af_cdouble*) signal.data(), signal.size() * sizeof(std::complex<double>), afHost);
// FFT
// 351 ms. equivalent to:
// af::array fft = af::fft(s, signal.size());
af::array fft = zrp::timeit(&af::fft, s, signal.size());
fft.eval();
// Convolution
// Copy result to host
// free memory (eventually move outside function)
// 0 ms.
af::freePinned((void*) s.device<af::af_cdouble>());
// Return result
}
As I said above the FFT is taking ~ 400 msec. This function using Armadillo takes ~ 110 msec. including the convolution, the FFT using FFTW takes about 5 msec. Also on my machine using the ArrayFire FFT example I get the following results (modified to use c64)
A = randu(1, N, c64);)
Benchmark 1-by-N CX fft
1 x 128: time: 29 us.
1 x 256: time: 31 us.
1 x 512: time: 33 us.
1 x 1024: time: 41 us.
1 x 2048: time: 53 us.
1 x 4096: time: 75 us.
1 x 8192: time: 109 us.
1 x 16384: time: 179 us.
1 x 32768: time: 328 us.
1 x 65536: time: 626 us.
1 x 131072: time: 1227 us.
1 x 262144: time: 2423 us.
1 x 524288: time: 4813 us.
1 x 1048576: time: 9590 us.
So the only difference I can see is the use of pinned memory. Any idea where I'm going wrong? Thanks.
EDIT
I noticed when running the AF FFT eaxample there is a significant delay before printing out the 1st time (even though the time doesn't include this delay). So I decided to make a class and move all of the allocations/deallocations into the ctor/dtor. Out of curiosity I also put an FFT in the ctor, because I also noticed if I ran a second FFT it took ~ 600 usec. consistent w/ my benchmarks. Sure enough running a "preliminary" FFT seems to "initialize" something and subsequent FFT's run much faster. There has to be a better way, I must be missing something.