CUDA: cufftExecR2C - Unnecessary memory copies

Question

I am experimenting with cuda and observe that data is copied from host to device when I invoke

cufftExecR2C(plan, src, dst);

which I don't undertand since my src pointer is a valid handle to the device memory that I would like to transform. Before cufftExecR2C(...) I initialized the arguments as follows:

  float* src;
  cudaMalloc((&src),  image_rows * image_cols  * sizeof(float) );
  cudaMemcpy(src, image.data()  ,  image_rows * image_cols  * sizeof(float)  , cudaMemcpyHostToDevice);

cufftComplex* dst;
cudaMalloc((void**)&dst    , image_rows * (image_cols/2+1) * sizeof(cufftComplex) );

 cufftHandle plan;
 cufftPlan2d(&plan, image_rows, image_cols, CUFFT_R2C))

Launching the nvidia profiler (nvprof) - only considering the fft - I get the following result

...
cudaProfilerStart();
cufftExecR2C(plan, src, dst);
cudaProfilerStop();
...

I would like to avoid the 3 unnecessary host to device copy calls. I don't see why cuda performs these additional copies (Especially why host to device - the data is already in the device memory)?

The program is executed on a GeForce GT 540M using Cuda 8.0.

Thank you!

How do you know the transfers are "unnecessary"? The are probably related to library internals which are required to setup the call. The first one appears slow because it is capturing the context establishment costs of your application — talonmies, Jul 01 '18 at 14:41
Unnecessary because the data resides already on the device. The profiling tool shows host to device copies. I do not see why that is needed. That would mean the data is copied back to the host and then to the device again. — Martin1988a, Jul 01 '18 at 18:20
Again, you don't know what it is copying and it is unlikely to be your data. I would wager it is transfer to library internals. Post a proper Minimal, Complete, and Verifiable example and the accompanying profiler API trace for it and I believe it will be possible to disprove your hypothesis by inspection — talonmies, Jul 01 '18 at 18:51
with a bit of careful thought, you'll come to the conclusion that the `cufft` API (and any functions spawned by it) are not touching your host data referenced by your `image.data()` pointer. To posit that, we would eventually reach absurd conclusions about cufft API behavior. Once you come to agreement about that, then the premise of this question, that the data copies are "unnecessary" is no longer a foregone conclusion, and you would need to justify why you think they are "unnecessary". I think that would be difficult, unless you actually know what data is being copied. — Robert Crovella, Jul 01 '18 at 21:25
Ok. I undertand that there may be internal copies that are required by the library. I also see that it is not my host data that is accessed by the library. But I would like to understand why the library requires datatransfers from host to device (That´s what I was referring to when I said unneccessary (Maybe not the right term). I would be ok with device to device but ...). @talonmies I will provide an example tomorrow... Thanks — Martin1988a, Jul 03 '18 at 07:25

talonmies · Answer 1 · 2018-07-13T08:08:51.393

Despite your rather earnest assertions regarding cuFFT performing unnecessary data transfers during cufftExecR2C execution, it is trivial to demonstrate that this is, in fact, not the case.

Consider the following example, cobbled together from the code snippets you presented in your question:

#include "cufft.h"
#include "cuda_profiler_api.h"
#include <random>
#include <algorithm>
#include <iterator>
#include <iostream>
#include <functional>

int main()
{
  const int image_rows = 1600, image_cols = 2048;

  std::random_device rnd_device;
  std::mt19937 mersenne_engine {rnd_device()};
  std::uniform_real_distribution<float> dist {0.0, 255.0};

  auto gen = [&dist, &mersenne_engine](){
                 return dist(mersenne_engine);
             };

  std::vector<float> image(image_rows * image_cols);
  std::generate(std::begin(image), std::end(image), gen);

  float* src;
  cudaMalloc((&src),  image_rows * image_cols  * sizeof(float) );
  cudaMemcpy(src, &image[0],  image_rows * image_cols  * sizeof(float)  , cudaMemcpyHostToDevice);
  cufftComplex* dst;
  cudaMalloc((void**)&dst    , image_rows * (image_cols/2+1) * sizeof(cufftComplex) );

  cufftHandle plan;
  cufftPlan2d(&plan, image_rows, image_cols, CUFFT_R2C);

  cudaProfilerStart();
  cufftExecR2C(plan, src, dst);
  cudaProfilerStop();

  return 0;
}

I have substituted an array of random values for your image. Now let's compile and profile it:

$ nvcc -std=c++11 -o unecessary unecessary.cu -lcufft
$ nvprof ./unecessary
==10314== NVPROF is profiling process 10314, command: ./unecessary
==10314== Profiling application: ./unecessary
==10314== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   74.39%  2.2136ms         1  2.2136ms  2.2136ms  2.2136ms  [CUDA memcpy HtoD]
                    6.66%  198.30us         1  198.30us  198.30us  198.30us  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
                    6.50%  193.47us         1  193.47us  193.47us  193.47us  void spRadix0025B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=64, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
                    6.25%  185.98us         1  185.98us  185.98us  185.98us  void spVector1024C::kernelMem<unsigned int, float, fftDirection_t=-1, unsigned int=2, unsigned int=5, LUT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_t, unsigned int, float>)
                    6.20%  184.38us         1  184.38us  184.38us  184.38us  __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelMemIjfL9fftAxii_t3EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t

[API calls removed for brevity]

It looks like you are right! A huge memcpy right there in the GPU summary statistics!

So let's profile it again properly:

$ nvprof --profile-from-start off ./unecessary
==11674== NVPROF is profiling process 11674, command: ./unecessary
==11674== Profiling application: ./unecessary
==11674== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   25.96%  196.28us         1  196.28us  196.28us  196.28us  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
                   25.25%  190.91us         1  190.91us  190.91us  190.91us  void spRadix0025B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=64, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
                   24.65%  186.39us         1  186.39us  186.39us  186.39us  void spVector1024C::kernelMem<unsigned int, float, fftDirection_t=-1, unsigned int=2, unsigned int=5, LUT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_t, unsigned int, float>)
                   24.15%  182.59us         1  182.59us  182.59us  182.59us  __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelMemIjfL9fftAxii_t3EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t

[Again, API calls removed for brevity]

The memcpy is gone. All that the profiler reports is four kernel launches associated with the transform execution. No memory transfers. The memory transfer reported in the original profiler output is the host to device transfer at the beginning of the program and is not associated with the cuFFT call. The reason it is included is that nvprof defaults to profiling on from the beginning of program execution, and the initial cudaProfilerStart call has no effect because profiling was already on. You can read about the correct way to profile code in the toolchain documentation here.

I will offer my own hypothesis in the absence of the promised MCVE -- you didn't use the profiler correctly, and the transfers reported are, in fact, transfers which occur elsewhere in your code and which are included in the profiler output, but are completely unrelated to the operation of cuFFT.

CUDA: cufftExecR2C - Unnecessary memory copies

1 Answers1