26

I'm on Ubuntu 14.04, CUDA toolkit 8, driver version 367.48.

When I give nvidia-smi command, it just hangs indefinitely. When I login again and try to kill that nvidia-smi process, with kill -9 <PID> for example, it just isn't killed. If I give another nvidia-smi command, I find both the processes running - of course when logging from another shell, because that gets stuck as before.

Can it be an issue related to the driver? It's not the latest, but still quite new..

bio
  • 501
  • 1
  • 5
  • 16
  • 1
    It's not a real answer, but it's good to know that the issue disappeared removing the 367 driver and installing again via the `apt` package, which ships the **361.93.02** version of the Nvidia driver. – bio Jan 30 '17 at 14:19
  • This happened to me too. I wonder how a process in running state can not be killed with SIGKILL? – reith Feb 16 '18 at 17:31
  • 1
    @Reith there are some special process states the kernel cannot terminate: The init process, zombie processes and uninterruptibly sleeping processes (these wake up only when a certain IO resource becomes available). These can only be killed by a shutdown/reboot. – Robert Hönig Mar 06 '18 at 02:06

2 Answers2

24

I solved this problem by doing at every boot

sudo nvidia-smi -pm 1

The above command enables persistence mode. This issue has been affecting nvidia drivers for over two years but they don't seem interested in fixing it. It seems to be related with a power management issue, after a bit of booting into the OS, if the nvidia-persistenced service has the no-persistence-mode option enabled, the GPU will save power, and the nvidia-smi command will hang waiting for something giving it control again on the device

lurscher
  • 25,930
  • 29
  • 122
  • 185
  • Thanks. I would accept your answer since it seems detailed enough, but I can't test it now. I guess I'll accept it anyway though :D – bio May 21 '18 at 09:20
  • 2
    Thanks. The issue still exists even with Driver Version: 410.79 and CUDA 10. I need wakeonlan to start and stop a T480 with a RTX2080 egpu. Most time the nvidia-persistenced service hangs and only physical power off kills the service. In my nvidia-persistenced service the no-persistence-mode option is not enabled. It is really a mess with this nvidia-persistenced service, by default it doesn't work. – Yingding Wang Feb 21 '19 at 17:36
1

Given your peculiar situation, I would try to reinstall it, as bio proposed.

Have you tried doing sudo kill -9 <PID>? You probably have but still putting it out there. Or, perhaps doing sudo kill -15 <PID> to terminate it. This seems as if your driver is stuck in a signal 1 hangup given what you told us.

It seems odd that nvidia-smi would hang spontaneously when run, but the issue may underlie in not being installed correctly or not getting run with superuser access.

Have you tried to use:

service nvidia-smi status pgrep nvidia-smi ps -aux | grep nvidia-smi

to get its current state?

Anyway, hope this helps. I would try to uninstall and reinstall or use sudo apt --fix-broken to try and fix broken packages/drivers.

Cheers!