2

The following code has been adapted from here to apply to a single 1D transformation using cufftPlan1d. Ultimately I want to perform a batched in place R2C transformation, but code below perfroms a single transformation using a separate input and output array.

How can adapt this code to perform a the transformation inplace, therefore reducing the amount of memory allocated on the device?

Thanks
Cuda 6.5 - Note: I'm running the code from a mexFunction in MATLAB 2015a

Code:

#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cufft.h>
#define DATASIZE 8
#define BATCH 1
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool  abort=true)
{
   if (code != cudaSuccess) 
   {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);            
        if (abort) exit(code);
   }
}

void main(int argc, char **argv)
{   

// --- Host side input data allocation and initialization
cufftReal *hostInputData = (cufftReal*)malloc(DATASIZE*sizeof(cufftReal));
for (int j=0; j<DATASIZE; j++) hostInputData[j] = (cufftReal)(j + 1);

// --- Device side input data allocation and initialization
cufftReal *deviceInputData; 
gpuErrchk(cudaMalloc((void**)&deviceInputData, DATASIZE * sizeof(cufftReal)));
cudaMemcpy(deviceInputData, hostInputData, DATASIZE * sizeof(cufftReal), cudaMemcpyHostToDevice);

// --- Host side output data allocation
cufftComplex *hostOutputData = (cufftComplex*)malloc((DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex));

// --- Device side output data allocation
cufftComplex *deviceOutputData;   gpuErrchk(cudaMalloc((void**)&deviceOutputData, (DATASIZE / 2 + 1) * sizeof(cufftComplex)));

cufftResult cufftStatus;
cufftHandle handle;

cufftStatus = cufftPlan1d(&handle, DATASIZE, CUFFT_R2C, BATCH);
if (cufftStatus != cudaSuccess) { mexPrintf("cufftPlan1d failed!"); }       

cufftStatus = cufftExecR2C(handle,  deviceInputData, deviceOutputData);
if (cufftStatus != cudaSuccess) { mexPrintf("cufftExecR2C failed!"); }  

// --- Device->Host copy of the results
gpuErrchk(cudaMemcpy(hostOutputData, deviceOutputData, (DATASIZE / 2 + 1) * sizeof(cufftComplex), cudaMemcpyDeviceToHost));

for (int j=0; j<(DATASIZE / 2 + 1); j++)
        printf("%i %f %f\n", j, hostOutputData[j].x, hostOutputData[j].y);

cufftDestroy(handle);
gpuErrchk(cudaFree(deviceOutputData));
gpuErrchk(cudaFree(deviceInputData));

}
AlexS
  • 510
  • 2
  • 7
  • 23
  • what error do you get? compile time or runtime? which CUDA version are you using? – m.s. Mar 24 '15 at 19:17
  • CUDA 6.5. Ive updated post to reflect this. I've not got as far as errors I can't see how to do it in principal yet. How dpes one populate a cufftComplex so that (cufftReal*)data will work? – AlexS Mar 24 '15 at 19:25
  • @m.s. I've updated the question to more rigorously address your comment. – AlexS Mar 25 '15 at 16:44
  • can you provide a compilable, self-contained example (see http://sscce.org/) without any MATLAB dependencies? Please add a main function containing your example data as well as the kernel launch. – m.s. Mar 25 '15 at 16:53
  • @m.s. There is no need for a kernel launch function, the transformation is performed by cufftExecR2C which is a built in function and the only MATLAB dependency is the mexPrintf function which can be readily exchanged with printf function which is commented out above. I changed the function to main(); – AlexS Mar 25 '15 at 16:57

2 Answers2

0

The solution has already been given in another answer: https://stackoverflow.com/a/19208070/678093

For your example, this means:

Allocate input as cufftComplex:

cufftComplex *deviceInputData;
gpuErrchk(cudaMalloc((void**)&deviceInputData, DATASIZE * sizeof(cufftComplex)));
cudaMemcpy(deviceInputData, hostInputData, DATASIZE * sizeof(cufftReal), cudaMemcpyHostToDevice);

In-Place transformation:

cufftStatus = cufftExecR2C(handle,  (cufftReal *)deviceInputData, deviceInputData);
gpuErrchk(cudaMemcpy(hostOutputData, deviceInputData, (DATASIZE / 2 + 1) * sizeof(cufftComplex), cudaMemcpyDeviceToHost));

btw: MATLAB also contains a GPU accelerated version of fft(), maybe this could be useful for you as well: http://de.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html#btjw5gk

Community
  • 1
  • 1
m.s.
  • 16,063
  • 7
  • 53
  • 88
  • Thanks, I finally worked it out myself as well, but i did it slightly different by starting with an array of input data cufftReal, but allocating an array of length (DATASIZE/2+1) * sizeof(cufftComplex) to deal with the 2 extra floasts produced by the transformation. – AlexS Mar 25 '15 at 18:58
0

Here is my own complete solution that start with cufftReal instead

void process(double *x, double *y, size_t n){
// --- Host side input data allocation and initialization
cufftReal *hostInputData = (cufftReal*)malloc(DATASIZE*sizeof(cufftReal));
for (int j=0; j<DATASIZE; j++) hostInputData[j] = (cufftReal)x[j];

// --- Device side input data allocation and initialization
cufftReal *deviceData; 
gpuErrchk(cudaMalloc((void**)&deviceData, (DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex)));
cudaMemcpy(deviceData, hostInputData, DATASIZE * sizeof(cufftReal), cudaMemcpyHostToDevice);

// --- Host side output data allocation
cufftComplex *hostOutputData = (cufftComplex*)malloc((DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex));

cufftResult cufftStatus;
cufftHandle handle;

cufftStatus = cufftPlan1d(&handle, DATASIZE, CUFFT_R2C, BATCH);
if (cufftStatus != cudaSuccess) { mexPrintf("cufftPlan1d failed!"); }       

cufftStatus = cufftExecR2C(handle,  deviceData, (cufftComplex*)deviceData);
if (cufftStatus != cudaSuccess) { mexPrintf("cufftExecR2C failed!"); }  

// --- Device->Host copy of the results
gpuErrchk(cudaMemcpy(hostOutputData, deviceData, (DATASIZE / 2 + 1) * sizeof(cufftComplex), cudaMemcpyDeviceToHost));

for (int j=0; j<(DATASIZE / 2 + 1); j++)
        mexPrintf("%i %f %f\n", j, hostOutputData[j].x, hostOutputData[j].y);

cufftDestroy(handle);
gpuErrchk(cudaFree(deviceData));}
AlexS
  • 510
  • 2
  • 7
  • 23