0

I was testing dynamic allocation, i.e

__device__ double *temp;
__global__
void test(){
    temp = new double[125000]; //1MB
}

calling this function 100 times to see if the memory was decreasing:

size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6)); 

for(int t=0;t<100;t++){
        test<<<1, 1>>>();
        CUDA_CHECK(cudaDeviceSynchronize());  
        fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));
    }
CUDA_CHECK(cudaMemGetInfo(&free, &total));
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));

and it actually was.

  1. Note : when trying WITHOUT the call to function AND the cudaMemGetInfo INSIDE the loop, it was decreasing from 800 to 650 Mo, and I concluded that the output to console took ~150 Mo. Indeed, when trying the code like written above, the result doesn't change. But it's huge !
  2. I get a decrease in available memory of ~50Mo after the loop (I don't get any decrease by commenting the call to kernel hopefully). When I add a delete(temp) inside the kernel, it seems not to reduce much the amount of wasted memory, I still have a decrease of ~30Mo. Why?
  3. Using a cudaFree(temp) or a cudadeviceReset() after the loop doesn't help much neither. Why? And how to deallocate memory?
François Laenen
  • 171
  • 4
  • 14

1 Answers1

3

It really sounds like you need to read this question and answer pair before going much further.

The memory you are allocating with new inside the kernel comes from a static runtime heap which is allocated as part of "lazy" context establishment which is initiated by the CUDA runtime when your program runs. The first CUDA call which establises the context will also load the module containing the kernel code and reserve local memory, runtime buffers and runtime heap for the kernel calls which follow. That is where most of the memory consumption you have observed is coming from. The runtime API contains a call which allows user control over the size of the allocations.

You should find that doing something like this on CUDA version 4 or 5:

size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", 
                    free/1e6, total/1e6); 

cudaFree(0);

CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", 
                    free/1e6, total/1e6); 

// Kernel loop follows

[disclaimer: written in browser, use at own risk ]

should show that the available memory drops after the cudaFree(0) call, because that should initiate the context initialisation sequence which is consuming memory on your GPU.

Community
  • 1
  • 1
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Ok thanks! I still have difficulties with the concept of cuda context, but I'm reading the documentation. – François Laenen Jul 29 '13 at 12:24
  • @FrançoisLaenen: A context is a host process/thread specific "session" on a given GPU. Each context has its own virtual address space on the GPU into which the CUDA runtime support, code and data are loaded. In the runtime API, this is mostly hidden, but if you look at the driver API documentation you can see explicit calls for managing contexts. – talonmies Jul 29 '13 at 13:48
  • Ok I get it better now. So for example, a pointer returned by cudaMalloc from host won't be usable by another context, because it will be loaded in the VAS of the first context? – François Laenen Jul 29 '13 at 15:25
  • 1
    @FrançoisLaenen: That is mostly correct, yes. The CUDA API does actually have a couple of special API features to allow pointer portability and context sharing between processes, but the "standard" model works the way you have described. – talonmies Jul 29 '13 at 15:36
  • Ok. I will post another question about heap size for which I cannot find answer. – François Laenen Jul 29 '13 at 16:00