I have run into an issue where helm install
ing my charts will work fine, but when I go to restart the system, the nvidia gpu operator will fail to validate.
Bootstrapping is simple:
$ microk8s enable gpu
< watching dashboard for all the pods to turn green >
$ microk8s helm install -n morpheus morpheus-ai-engine morpheus-ai-engine
< watching for the morpheus pods to turn green >
Now I can check if the ai-engine
pod has GPU access:
$ kubectl exec ai-engine-897d65cff-b2trz -- nvidia-smi
Wed Feb 22 16:35:32 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P400 Off | 00000000:04:00.0 Off | N/A |
| 0% 38C P8 N/A / 30W | 98MiB / 2048MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Running the test vector-add pod returns a Test PASSED
.
The trouble comes when I restart microk8s. The nvidia-device-plugin-validator
pod fails to load with an UnexpectedAdmissionError
claiming that no GPUs are available. And running nvidia-smi
in the ai-engine
pod returns a "command not found". The vector-add test pod won't start due to insufficient GPUs.
But if I uninstall the ai-engine
chart and restart microk8s (waiting for the gpu operator pods to all turn green), I can then reinstall ai-engine
and it works fine again, as does the vector-add test.