CUDA NPP - Error on printing output

Question

Following my previous post here: CUDA NPP - unknown error upon GPU error check

I have tried to sum all the pixels in the image by using the CUDA NPP library, and with the help of some developers, I finally got my code to compile. However, when I try and print out the value which is stored in partialSum by copying it into a double variable (consistent with the NPP guide for CUDA v4.2), I get this error:

Unhandled exception at 0x00fdf7f4 in MedianFilter.exe: 0xC0000005: Access violation reading location 0x40000000.

I've been trying to get rid of it but, I have been unsuccessful so far. Please help! I have at this small piece of code for about two days now.

Code:

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) getchar();
    }
}

// processing image starts here 

// device_pointer initializations
unsigned char *device_input;
unsigned char *device_output;    

size_t d_ipimgSize = input.step * input.rows;
size_t d_opimgSize = output.step * output.rows;

gpuErrchk( cudaMalloc( (void**) &device_input, d_ipimgSize) );
gpuErrchk( cudaMalloc( (void**) &device_output, d_opimgSize) );

gpuErrchk( cudaMemcpy(device_input, input.data, d_ipimgSize, cudaMemcpyHostToDevice) );


// Median filter the input image here
// .......


// allocate data on the host for comparing the sum of all pixels in image with CUDA implementation

// 1st argument - allocate data for pSrc - copy device_output into this pointer
Npp8u *odata; 
gpuErrchk( cudaMalloc( (void**) &odata, sizeof(Npp8u)*output.rows*output.cols ) );
gpuErrchk( cudaMemcpy(odata, device_output, sizeof(Npp8u)*output.rows*output.cols, cudaMemcpyDeviceToDevice) ); 

// 2nd arg - set step 
int ostep = output.step;  

// 3rd arg - set nppiSize
NppiSize imSize; 
imSize.width = output.cols; 
imSize.height = output.rows;

// 4th arg - set npp8u scratch buffer size
Npp8u *scratch; 
int bytes = 0;
nppiReductionGetBufferHostSize_8u_C1R( imSize, &bytes);

gpuErrchk( cudaMalloc((void **)&scratch, bytes) );

// 5th arg - set npp64f partialSum (64 bit double will be the result)
Npp64f *partialSum; 
gpuErrchk( cudaMalloc( (void**) &partialSum, sizeof(Npp64f) ) );

//                 nnp8u, int, nppisize, npp8u, npp64f    
nppiSum_8u_C1R( odata, ostep, imSize, scratch, partialSum );

double *dev_result;
    dev_result = (double*)malloc(sizeof(double)); // EDIT
gpuErrchk( cudaMemcpy(&dev_result, partialSum, sizeof(double), cudaMemcpyDeviceToHost) );
//int tot = output.rows * output.cols;
printf( "\n Total Sum cuda %f \n",  *dev_result) ;   // <---- access violation here

I'm only guessing here, but I hope this helps you out. It's weird from a very very superficial POV that you are mixing GPU memory and RAM memory variables. Try printing "Hello World". If it's ok go for next, If your inline code goes through the GPU pipeline (or uses multi SIMD), then your %s , %d will point to GPU memory instead of RAM memory, while fprintf will use kernell (which accesses RAM memory). — MichaelCMS, Mar 21 '14 at 23:52
You have never allocated any memory for `dev_result` on the host, which is resulting in the cudaMemcpy corrupting the stack and crashing your program. I have voted to close this, I think it is reasonable to expect a modicum of debugging and analysis before resorting to posting an [SO] question. — talonmies, Mar 23 '14 at 12:47
@talonmies I recognized that I didn't `malloc dev_result` before and so I did allocate it. It still throws me the same error. I have tried everything, and yet it does not print out the result. Obviously, I try to debug my answer before posting; otherwise, the entire concept of me learning CUDA NPP will be null and void. — Eagle, Mar 23 '14 at 13:24

score 2 · Accepted Answer · answered Mar 24 '14 at 06:44

The problem here seems to be basic pointer misuse (I say seems because we have incomplete, uncompilable code, so it is hard to say for certain).

This should work:

double *dev_result = (double*)malloc(sizeof(double));
gpuErrchk( cudaMemcpy(dev_result, partialSum, sizeof(double), cudaMemcpyDeviceToHost) );
printf( "\n Total Sum cuda %f \n",  *dev_result);

this should also work:

double dev_result;
gpuErrchk( cudaMemcpy(&dev_result, partialSum, sizeof(double), cudaMemcpyDeviceToHost) );
printf( "\n Total Sum cuda %f \n",  dev_result);

This assumes that everything else in the incomplete code is correct. I leave it as an exercise to the reader to spot the differences between the three variants.

Thanks for the down-vote. I deserve it for accessing garbage values by dereferencing `dev_result` when no values were written to it by the `cudaMemcpy` process. I deserve a rap on the head for that; silly me. :| — Eagle, Mar 25 '14 at 23:03

CUDA NPP - Error on printing output

1 Answers1