Despite your rather earnest assertions regarding cuFFT performing unnecessary data transfers during cufftExecR2C
execution, it is trivial to demonstrate that this is, in fact, not the case.
Consider the following example, cobbled together from the code snippets you presented in your question:
#include "cufft.h"
#include "cuda_profiler_api.h"
#include <random>
#include <algorithm>
#include <iterator>
#include <iostream>
#include <functional>
int main()
{
const int image_rows = 1600, image_cols = 2048;
std::random_device rnd_device;
std::mt19937 mersenne_engine {rnd_device()};
std::uniform_real_distribution<float> dist {0.0, 255.0};
auto gen = [&dist, &mersenne_engine](){
return dist(mersenne_engine);
};
std::vector<float> image(image_rows * image_cols);
std::generate(std::begin(image), std::end(image), gen);
float* src;
cudaMalloc((&src), image_rows * image_cols * sizeof(float) );
cudaMemcpy(src, &image[0], image_rows * image_cols * sizeof(float) , cudaMemcpyHostToDevice);
cufftComplex* dst;
cudaMalloc((void**)&dst , image_rows * (image_cols/2+1) * sizeof(cufftComplex) );
cufftHandle plan;
cufftPlan2d(&plan, image_rows, image_cols, CUFFT_R2C);
cudaProfilerStart();
cufftExecR2C(plan, src, dst);
cudaProfilerStop();
return 0;
}
I have substituted an array of random values for your image. Now let's compile and profile it:
$ nvcc -std=c++11 -o unecessary unecessary.cu -lcufft
$ nvprof ./unecessary
==10314== NVPROF is profiling process 10314, command: ./unecessary
==10314== Profiling application: ./unecessary
==10314== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 74.39% 2.2136ms 1 2.2136ms 2.2136ms 2.2136ms [CUDA memcpy HtoD]
6.66% 198.30us 1 198.30us 198.30us 198.30us void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
6.50% 193.47us 1 193.47us 193.47us 193.47us void spRadix0025B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=64, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
6.25% 185.98us 1 185.98us 185.98us 185.98us void spVector1024C::kernelMem<unsigned int, float, fftDirection_t=-1, unsigned int=2, unsigned int=5, LUT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_t, unsigned int, float>)
6.20% 184.38us 1 184.38us 184.38us 184.38us __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelMemIjfL9fftAxii_t3EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t
[API calls removed for brevity]
It looks like you are right! A huge memcpy
right there in the GPU summary statistics!
So let's profile it again properly:
$ nvprof --profile-from-start off ./unecessary
==11674== NVPROF is profiling process 11674, command: ./unecessary
==11674== Profiling application: ./unecessary
==11674== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 25.96% 196.28us 1 196.28us 196.28us 196.28us void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
25.25% 190.91us 1 190.91us 190.91us 190.91us void spRadix0025B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=64, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
24.65% 186.39us 1 186.39us 186.39us 186.39us void spVector1024C::kernelMem<unsigned int, float, fftDirection_t=-1, unsigned int=2, unsigned int=5, LUT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_t, unsigned int, float>)
24.15% 182.59us 1 182.59us 182.59us 182.59us __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelMemIjfL9fftAxii_t3EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t
[Again, API calls removed for brevity]
The memcpy
is gone. All that the profiler reports is four kernel launches associated with the transform execution. No memory transfers. The memory transfer reported in the original profiler output is the host to device transfer at the beginning of the program and is not associated with the cuFFT call. The reason it is included is that nvprof defaults to profiling on from the beginning of program execution, and the initial cudaProfilerStart
call has no effect because profiling was already on. You can read about the correct way to profile code in the toolchain documentation here.
I will offer my own hypothesis in the absence of the promised MCVE -- you didn't use the profiler correctly, and the transfers reported are, in fact, transfers which occur elsewhere in your code and which are included in the profiler output, but are completely unrelated to the operation of cuFFT.