1

I am loading NLP models in GPU to do inferencing. But once the inference is over the GPU does not deallocate its memory:

nvidia-smi image

But then the command ps -a | grep python gave me

ps -a image

How do I solve this issue?

gawi
  • 2,843
  • 4
  • 29
  • 44

2 Answers2

2

I'm having a similar problem, a pytorch process on the GPU became zombie and left GPU memory used. Furthermore, in my case the process showed 100% usage in the GPU (GPU-util in the nvidia-smi output). The only solution I have found so far is rebooting the system.

In case you want to try other solutions, I tried before rebooting (without succeed):

  • Killing the parent of the zombie process: see this answer. After this, the child zombie process became child of init (pid=1). init should reap zombie processes automatically, but this did not happen in my case (the process could still be found with ps, and the gpu memory was not freed).
  • Sending SIGCHLD to init (command: kill -17 1), to force reaping, but init still did not reap the process, and the gpu memory remained being used.
  • As suggested by this answer, I checked other child processes that may be related and using the GPU: fuser -v /dev/nvidia*, but no other python processes were found in my case (other than the original zombie process).
  • As suggested in this issue, killing processes that are accessing /dev/nvidia0, by running fuser -k /dev/nvidia0. This did not affect the zombie process.
  • Clearing the gpu with nvidia-smi: nvidia-smi --gpu-reset -i <device>, but this throwed device is currently being used by one or more other processes... Please first kill all processes using this device...

At the end, the only solution was rebooting the system.

I'm not sure what caused the error in the first place. I had a pytorch script training in a single GPU, and I have used the same script many times without issue. I used a Dataloader using num_workers=5, which I suspect may have been the culprit, but I cannot be sure. The process suddenly just hang, without throwing an exception or anything, and left the GPU unusable.

I'm using versions: pytorch 1.7.1+cu110, nvidia-driver 455.45.01, running in Ubuntu 18.04

pdpino
  • 444
  • 4
  • 13
0

I killed all python processes (pkill python), and zombies are no more on the GPU. I was using torch.

george
  • 1