I was testing dynamic allocation, i.e
__device__ double *temp;
__global__
void test(){
temp = new double[125000]; //1MB
}
calling this function 100 times to see if the memory was decreasing:
size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));
for(int t=0;t<100;t++){
test<<<1, 1>>>();
CUDA_CHECK(cudaDeviceSynchronize());
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));
}
CUDA_CHECK(cudaMemGetInfo(&free, &total));
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));
and it actually was.
- Note : when trying WITHOUT the call to function AND the cudaMemGetInfo INSIDE the loop, it was decreasing from 800 to 650 Mo, and I concluded that the output to console took ~150 Mo. Indeed, when trying the code like written above, the result doesn't change. But it's huge !
- I get a decrease in available memory of ~50Mo after the loop (I don't get any decrease by commenting the call to kernel hopefully). When I add a delete(temp) inside the kernel, it seems not to reduce much the amount of wasted memory, I still have a decrease of ~30Mo. Why?
- Using a cudaFree(temp) or a cudadeviceReset() after the loop doesn't help much neither. Why? And how to deallocate memory?