How CudaMalloc work?

Question

I am trying to modify the imageDenosing class in CUDA SDK, I need to repeat the filter many time incase to capture the time. But my code doesn't work properly.

//start

__global__ void F1D(TColor *image,int imageW,int imageH, TColor *buffer)
{  

const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

    if(iy != 0 && iy < imageH-1  && ix < imageW)
    {

        float4 fresult = get_color(image[imageW * iy + ix]);
        float4 fresult4 = get_color(image[imageW * (iy+1) + ix]);
        float4 fresult5 = get_color(image[imageW * (iy-1) + ix]);

        float4 fresult7; 
            fresult7.x = fresult.x*0.5+fresult4.x*.25+fresult5.x*.25;
            fresult7.y = fresult.y*0.5+fresult4.y*.25+fresult5.y*.25;
            fresult7.z = fresult.z*0.5+fresult4.z*.25+fresult5.z*.25;

        buffer[imageW * iy + ix] =      
            make_color(fresult7.x,fresult7.y,fresult7.z,0);     

    }

    image[imageW * iy + ix] =   buffer[imageW * iy + ix];
    //should be use cudaMemcpy, But it fails
}

//extern

extern "C" void
cuda_F1D(TColor *dst, int imageW, int imageH)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Copy<<<grid, threads>>>(dst, imageW, imageH);

size_t size = imageW*imageH*sizeof(TColor);
TColor *host =(TColor*) malloc(size);
TColor *dst2;
//TColor *dst3;
//TColor *d = new TColor(imageW*imageH*sizeof(TColor));
dim3 threads2(imageW,1);
dim3 grid2(iDivUp(imageW, imageW), iDivUp(imageH, 1));

*for(int i = 0;i<1;i++)
{   
    cudaMalloc( (void **)&dst2, size);
            cudaMemcpy(dst2, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice);
//cudaMalloc( (void **)&dst3, imageW*imageH*sizeof(TColor));
//cudaMemcpy(dst3, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice);
    F1D<<<grid2, threads2>>>(dst, imageW, imageH,dst2);
//cudaMemcpy(dst, dst3, imageW*imageH*sizeof(TColor),cudaMemcpyDeviceToHost);
    cudaFree(dst2);
}*

}

This code works, but cant synchronise the array of image. and lead to many synchronise problem

Presumably dst is also cudaMalloc'd somewhere else? One thought for future reference, perhaps you intend to make the "i" loop perform more iterations - you should avoid cudaMalloc and cudaFree inside that loop and do them once only. It's not a good idea to have them on the performance path, they're not super-fast. You could also consider async memcpys if your algorithm allows it. — Tom, May 31 '10 at 08:41
@kitw: click on the **edit** button above to fix typos etc in your question — Paul R, Jun 04 '10 at 14:50

score 6 · Accepted Answer · answered May 30 '10 at 10:11

Your kernel is running asynchronously - you need to wait for it to complete, e.g.

cudaMalloc((void **)&dst2, size);
cudaMemcpy(dst2, dst, imageW * imageH * sizeof(TColor), cudaMemcpyHostToDevice);
F1D<<<grid2, threads2>>>(dst, imageW, imageH, dst2);
cudaThreadSynchronize(); // *** wait for kernel to complete ***
cudaFree(dst2);

score 0 · Answer 2 · answered Jun 04 '10 at 14:46

0

I already answered this for you when you posted the same question previously - you need to wait for a kernel to complete before running it again - add:

cudaThreadSynchronize(); // *** wait for kernel to complete ***

after the kernel call.

answered Jun 04 '10 at 14:46

Paul R

208,748
37
389
560

Yes thanks for your answer, but the cudaThreadSynchronize() does not help. I had tried before and somehow the result still not correct. – kitw Jun 04 '10 at 14:52
http://www.mypicx.com/uploadimg/30130182_06042010_1.jpg the results: left one is when applying first time middle one is after this program looping 100 times right one is what i supposed to had after 100 times – kitw Jun 04 '10 at 15:10
@kitw: you still don't have `cudaThreadSynchronize()` in the above code though - why would you leave this out ? The code as it is can never work in a loop without this. – Paul R Jun 05 '10 at 08:57

score 0 · Answer 3 · answered Jun 05 '10 at 03:19

The statement

image[imageW * iy + ix] =   buffer[imageW * iy + ix];

is causing the problem. You are overwriting your input image in the kernel. So depending on thread execution order, you would be further blurring parts of the image.

Also, I don't see the purpose of

cudaMemcpy(dst2, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice);

dst looks to be device memory since you have access to it in the cuda kernal.

How CudaMalloc work?

3 Answers3

Linked

Related