Why a simple CUDA function needs so much local memory?

Question

I've wrote a simple function on CUDA. It's resize an image to double scale. For an image at 1920*1080, this function need ~20ms to complete. I've tried some different way to optimize that function. And I found that may be local memory is the key reason.

I have tried three different method to fetch image.

The Gpu module in OpenCV
Texture bind to GpuMat in OpenCV
Direct fetch GpuMat from global memory

None of them could bring me a little improve.

Then I using the nvvp to find out the reason. And the local memory overhead is ~95% in all three conditions above.

So I turn to my code to find out how nvcc using memory. Then I found that a simple function just like this:

__global__ void performDoubleImage(float* outData, size_t step, const int cols, const int rows)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    if (x >= cols)
        return;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    if (y >= rows)
        return;
    ((float*)((size_t)outData+y*step))[x] = tex2D(texRef, x/2, y/2);
}

needs 80 bytes stack frame (they're in local memory).

And another function like this:

__global__ void performFinalDoubleImage(const PtrStepSz<float> in, PtrStepSz<float> out)
{
    out(out.rows-1, out.cols-1) = out(in.rows-1, in.cols-1);
}

also needs 88 bytes stack frame.

The question is, why my function using so much local memory and registers in this simple task? And why the function in OpenCV could perform same function by using no local memory (this is test by nvvp, the local memory load is ZERO)?

My code is compiled on debug mode. And my card is GT650(192 SP/SM, 2 SM).

Those functions can't possibly be using that much stack, they shouldn't even be using stack at all. What GPU is this? Are you sure you're interpreting the nvvp output correctly? Are you compiling in debug mode or with optimizations disabled (even then...) or anything special in the command line? — user703016, Dec 12 '14 at 15:11
My card is GT650, with 192 SP/SM and 2 SM. I compiled my code with debug. May be the debug mode is the reason? I can't give you the nvcc output because I'm home now, I will add the extra information in my question tomorrow, and thanks! — cs512, Dec 12 '14 at 16:05
How are you determining stack frame? I dropped your first kernel into a random cuda program, and when I compile with `-Xptxas -v` I get: `ptxas info : Function properties for _Z18performDoubleImagePfmii 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 9 registers, 56 bytes cmem[0], 1 textures` — Robert Crovella, Dec 12 '14 at 16:14
@cs512: Compiling in debug mode is the likely source of the stack frame usage. In order for variables to be trackable at all times they may need to be stored in local memory. — njuffa, Dec 12 '14 at 16:16
When I compile the same code and add `-G`, I get: `ptxas info : Function properties for _Z18performDoubleImagePfmii 256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 23 registers, 296 bytes cumulative stack size, 56 bytes cmem[0], 1 textures`, so, yes, debug (`-G`) seems to make a big difference. Since the original motivation for this question seems to be about performance, it's worth stating that it's never a good idea to analyze device code for performance if you compile with `-G` — Robert Crovella, Dec 12 '14 at 16:17
That's odd. Maybe the reason is that I compile it in debug mode? — cs512, Dec 12 '14 at 16:17
@ParkYoung-Bae you suggested the question about debug mode. If you would like to propose an answer I would upvote. — Robert Crovella, Dec 12 '14 at 16:24

score 6 · Accepted Answer · answered Dec 12 '14 at 16:33

The two functions you've posted are way too simple to be using that much stack, in fact they shouldn't be using stack at all. The most likely reason that there is so much spilling is that you are compiling with optimizations disabled (for example, in debug mode).

For reference, Robert Crovella compiled your first kernel in release and in debug mode:

Debug:

ptxas info : Function properties for _Z18performDoubleImagePfmii 256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 23 registers, 296 bytes cumulative stack size, 56 bytes cmem[0], 1 textures

Release:

ptxas info : Function properties for _Z18performDoubleImagePfmii 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 9 registers, 56 bytes cmem[0], 1 textures

Note the difference in stack and register usage. As noted in the comments, when measuring the performance of a program, you should always be compiling for the maximum optimization level, otherwise the measurements will be meaningless.

Why a simple CUDA function needs so much local memory?

1 Answers1