I've wrote a simple function on CUDA. It's resize an image to double scale. For an image at 1920*1080, this function need ~20ms to complete. I've tried some different way to optimize that function. And I found that may be local memory is the key reason.
I have tried three different method to fetch image.
- The Gpu module in OpenCV
- Texture bind to GpuMat in OpenCV
- Direct fetch GpuMat from global memory
None of them could bring me a little improve.
Then I using the nvvp to find out the reason. And the local memory overhead is ~95% in all three conditions above.
So I turn to my code to find out how nvcc using memory. Then I found that a simple function just like this:
__global__ void performDoubleImage(float* outData, size_t step, const int cols, const int rows)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
if (x >= cols)
return;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (y >= rows)
return;
((float*)((size_t)outData+y*step))[x] = tex2D(texRef, x/2, y/2);
}
needs 80 bytes stack frame (they're in local memory).
And another function like this:
__global__ void performFinalDoubleImage(const PtrStepSz<float> in, PtrStepSz<float> out)
{
out(out.rows-1, out.cols-1) = out(in.rows-1, in.cols-1);
}
also needs 88 bytes stack frame.
The question is, why my function using so much local memory and registers in this simple task? And why the function in OpenCV could perform same function by using no local memory (this is test by nvvp, the local memory load is ZERO)?
My code is compiled on debug mode. And my card is GT650(192 SP/SM, 2 SM).