Wait for nvidia gpu-operator before starting pods

Question

I have run into an issue where helm installing my charts will work fine, but when I go to restart the system, the nvidia gpu operator will fail to validate.

Bootstrapping is simple:

$ microk8s enable gpu

< watching dashboard for all the pods to turn green >

$ microk8s helm install -n morpheus morpheus-ai-engine morpheus-ai-engine

< watching for the morpheus pods to turn green >

Now I can check if the ai-engine pod has GPU access:

$ kubectl exec ai-engine-897d65cff-b2trz -- nvidia-smi
Wed Feb 22 16:35:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   38C    P8    N/A /  30W |     98MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Running the test vector-add pod returns a Test PASSED.

The trouble comes when I restart microk8s. The nvidia-device-plugin-validator pod fails to load with an UnexpectedAdmissionError claiming that no GPUs are available. And running nvidia-smi in the ai-engine pod returns a "command not found". The vector-add test pod won't start due to insufficient GPUs.

But if I uninstall the ai-engine chart and restart microk8s (waiting for the gpu operator pods to all turn green), I can then reinstall ai-engine and it works fine again, as does the vector-add test.

score 1 · Answer 1 · answered Mar 07 '23 at 13:05

1

This is an issue I am comming across too which lead me hear, it looks like it was just recently fixed with this patch https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/release-notes.html#id2

Which will evict pods requesting gpus while the operator starts up again. This should solve your issue as it did mine.

answered Mar 07 '23 at 13:05

Shane Hughes

33
7

Was this work for you? I guess the `ENABLE_GPU_POD_EVICTION` is the option that should have fixed it? – Assaf Sapir Mar 14 '23 at 11:30

Wait for nvidia gpu-operator before starting pods

1 Answers1