Amount of local memory per CUDA thread

Question

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0).

I tried unsuccessfully to check this limit on Linux with CUDA 6.5.

Here is the code I used (its only purpose is to test local memory limit, it doesn't make any usefull computation):

#include <iostream>
#include <stdio.h>

#define MEMSIZE 65000  // 65000 -> out of memory, 60000 -> ok

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if( abort )
            exit(code);
    }
}

inline void gpuCheckKernelExecutionError( const char *file, int line)
{
    gpuAssert( cudaPeekAtLastError(), file, line);
    gpuAssert( cudaDeviceSynchronize(), file, line);    
}


__global__ void kernel_test_private(char *output)
{
    int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
    int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row

    char tmp[MEMSIZE];
    for( int i = 0; i < MEMSIZE; i++)
        tmp[i] = 4*r + c; // dummy computation in local mem
    for( int i = 0; i < MEMSIZE; i++)
        output[i] = tmp[i];
}

int main( void)
{
    printf( "MEMSIZE=%d bytes.\n", MEMSIZE);

    // allocate memory
    char output[MEMSIZE];
    char *gpuOutput;
    cudaMalloc( (void**) &gpuOutput, MEMSIZE);

    // run kernel
    dim3 dimBlock( 1, 1);
    dim3 dimGrid( 1, 1);
    kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
    gpuCheckKernelExecutionError( __FILE__, __LINE__);

    // transfer data from GPU memory to CPU memory
    cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);

    // release resources
    cudaFree(gpuOutput);
    cudaDeviceReset();

    return 0;
}

And the compilation command line:

nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu

The compilation is ok, and reports:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info    : Function properties for _Z19kernel_test_privatePc
    65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 40 bytes cmem[0]

I got an "out of memory" error at runtime on the GTX 580 when I reached 65000 bytes per thread. Here is the exact output of the program in the console:

MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48

I also did a test with a GTX 770 GPU (on Linux with CUDA 6.5). It ran without error for MEMSIZE=200000, but the "out of memory error" occurred at runtime for MEMSIZE=250000.

How to explain this behavior ? Am I doing something wrong ?

which CUDA version are you using? Is this linux or is it windows? Are you getting the "out of memory error" when you **compile** the code or when you **run** the code? (paste exact error text into question) What is the command line you are using to compile the code? My guess is that you are compiling for a pre-cc2.0 architecture. If I compile this code for a cc1.1 architecture, I get the "out of memory error" when compiling, because cc1.x devices have an even smaller limit on local memory per thread (16KB). If I compile for a cc2.0 architecture, your code compiles and runs normally for me. — Robert Crovella, Mar 02 '15 at 14:54
Your problem may also be arising from this line of code: `char output[MEMSIZE];` This (host code) creates a stack-based allocation, and these types of allocations may be limited depending on the platform. Pasting the exact error text into the question would help. (you can edit your own question.) — Robert Crovella, Mar 02 '15 at 15:00
@RobertCrovella Thank you for your interest in my question. I have edited my question to add the missing informations. The exact error text as reported by cudaGetErrorString() at runtime is "out of memory". — devel484, Mar 03 '15 at 09:14
This is a really great q&a and almost certainly a common issue for cuda programmers (including myself). I hope it gets more visibility! — interestedparty333, Jun 18 '19 at 16:39

score 8 · Accepted Answer · answered Mar 03 '15 at 20:33

8

It seems you are running into not a local memory limitation but a stack size limitation:

ptxas info : Function properties for _Z19kernel_test_privatePc

65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

The variable that you had intended to be local is on the (GPU thread) stack, in this case.

Based on the information provided by @njuffa here, the available stack size limit is the lesser of:

The maximum local memory size (512KB for cc2.x and higher)
GPU memory/(#of SMs)/(max threads per SM)

Clearly, the first limit is not the issue. I assume you have a "standard" GTX580, which has 1.5GB memory and 16 SMs. A cc2.x device has a maximum of 1536 resident threads per multiprocessor. This means we have 1536MB/16/1536 = 1MB/16 = 65536 bytes stack. There is some overhead and other memory usage that subtracts from the total available memory, so the stack size limit is some amount below 65536, somewhere between 60000 and 65000 in your case, apparently.

I suspect a similar calculation on your GTX770 would yield a similar result, i.e. a maximum stack size between 200000 and 250000.

answered Mar 03 '15 at 20:33

Robert Crovella

143,785
11
213
257

1

Thank you for you explanation. I did the same calculation for the GTX 770 (4 GB RAM, 8 SMs, max 2048 threads per SM) and got a stack size of 262144 bytes. This is consistent with the limit I encountered on the GTX 770 between 200000 bytes and 250000 bytes (more precisely between 242000 bytes and 245000 bytes). I also replaced the static allocation in the kernel by a dynamic allocation using malloc(): then I was able to allocate 512 Ko of local memory (and even more, up to 7 Mo !). – devel484 Mar 04 '15 at 09:45
if you want more memory than that from in-kernel `malloc`, review [the documentation](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations) to raise the heap memory allocation limit. – Robert Crovella Mar 04 '15 at 14:16
Ok. I thought that the in-kernel call to `malloc` would allocate the **local** memory. Now I understand that the in-kernel `malloc` allocates the global memory heap, which has a default size of 8 Mo. – devel484 Mar 05 '15 at 13:09
3

Because the in-kernel `malloc` is not relevant to allocate local memory, the only way seems to use a static allocation in the kernel, which is limited by the thread stack (65 Ko in my GTX 580 case). So what does the 512 Ko "Amount of local memory per thread" reported in the [documentation](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications) apply to ? – devel484 Mar 05 '15 at 13:34

Amount of local memory per CUDA thread

1 Answers1

Linked