0

I'm using the Container Optimized OS to run an application that takes advantage of GPUs. I have a separate system that creates VMs to run this application on-demand (to minimize cost) and I've been trying to reduce the time to get my application running.

To do this, I've started using a custom VM image, which at the moment is just my application's docker container being pre-downloaded and saved to the COS image. I would also like to pre-install the Nvidia drivers for the GPU, but I can't seem to get it to stick. Despite installing the drivers, verifying they work, and then creating the image when I create a new VM using that image it's like the drivers weren't installed. The files appear to all be present though. I've tried running

sudo cos-extensions install gpu

In the startup script when creating the image, but the instances created from my image throw back an error when I try to run nvidia-smi

nvidia-smi and nvidia mounting commands

sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
/var/lib/nvidia/bin/nvidia-smi

Error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Despite this complaint, the libnvidia-ml.so file DOES exist at: /var/lib/nvidia/lib64

The contents of my /var/lib/nvidia directory are:

$ ls -lh /var/lib/nvidia/
total 354M
-rw-r--r-- 1 root root 354M Mar 10 23:12 NVIDIA-Linux-x86_64-470.141.03_101-17162-40-42.cos
drwxr-xr-x 2 root root 4.0K Mar 10 23:12 bin
drwxr-xr-x 3 root root 4.0K Mar 10 23:12 bin-workdir
drwxr-xr-x 2 root root 4.0K Mar 10 23:12 drivers
drwxr-xr-x 3 root root 4.0K Mar 10 23:12 drivers-workdir
drwxr-xr-x 3 root root 4.0K Mar 10 23:12 firmware
drwxr-xr-x 4 root root 4.0K Mar 10 23:12 lib64
drwxr-xr-x 3 root root 4.0K Mar 10 23:12 lib64-workdir
-rw-r--r-- 1 root root 2.2K Mar 10 23:12 nvidia-installer.log
-rw-r--r-- 1 root root 1.2K Mar 10 23:12 pubkey.der
drwxr-xr-x 3 root root 4.0K Mar 10 23:12 share

Is there a way to create a custom image with the Nvidia driver's pre-installed that I can use?

Ethan
  • 1,206
  • 3
  • 21
  • 39
  • Try running `LD_PRELOAD=/usr/lib/nvidia-XXX/libnvidia-ml.so nvidia-smi` and change the XXX by the driver version you have – Fariya Rahmat Mar 11 '23 at 04:13
  • I no longer get the missing library complaint, but I get a different error message: 'NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running'. I also had to change the path from '/usr/lib/nvidia-xxx' to '/var/lib/nvidia/lib64' since that's where container optimized OS keeps the nvidia drivers. – Ethan Mar 11 '23 at 18:12
  • Refer to this [Stack post](https://stackoverflow.com/questions/42984743/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver) to resolve the error.Let me know if this helps. – Fariya Rahmat Mar 14 '23 at 11:29
  • My question is really quite specific to the GCP COS images and the utility that comes with it. I could write my own custom Nvidia drivers installation script to pull in the Nvidia drivers as well as all the other utilities required manually, but I want to just get the cos installation tools to work properly as I'm already quite invested in using COS images. COS images also don't have apt or apt-get installeed, so I'd have to do a lot of manual stuff to get it working manually. – Ethan Mar 20 '23 at 20:37

0 Answers0