3

In this case, I'm using jupyter notebook on a VM for trainning some CNN models. the VM has 16v CPU with 60GB memory. And I just attched a NVIDIA TESLA P4 for better performance. But it always gives error like "RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 7.43 GiB total capacity; 2.20 GiB already allocated; 180.44 MiB free; 226.01 MiB cached)"

Why does it happen? The system is all clean. I want to know why I only have this small amount of memory free?

I think the GPU is set up without mistake

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    22W /  75W |      0MiB /  7611MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
talonmies
  • 70,661
  • 34
  • 192
  • 269
Amyano
  • 41
  • 1
  • 4
  • Add more description to your question. Which library you are using - TensorFlow, Keras or any other. Share the code segment where you're specifying the GPU(if you are). In the case of TensorFlow, you can restrict GPU memory usage by passing the "per_process_gpu_memory_fraction" flag. – Rohit Lal Dec 11 '19 at 07:27
  • tried decreasing the batch size? maybe to 2 or 8, just a hit and trial, If its a GPU issue or code issue? – Yash Kumar Atri Dec 11 '19 at 11:32

1 Answers1

2

When a process allocates memory on GPU, that memory can only be deallocated by that process or if it terminates. If you are seeing CUDA out of memory error but there is nothing else is running then I would suggest using tool like nvtop to figure out who is taking up your CUDA memory. It looks like this:

enter image description here

On the bottom you see GPU memory and process command line. In above example, the highlighted green process is taking up the 84% of GPU RAM. You can use up/down arrow to select the process and press F9 to kill the process. Some times when I run training scripts, they don't get terminated and it shows up here occupying the CUDA memory.

Note: nvtop install is bit involved on Ubuntu 18 but other tool you can use is gpustat which only shows pids.

Shital Shah
  • 63,284
  • 17
  • 238
  • 185