gpu worker node unable to join cluster

Question

I've a EKS setup (v1.16) with 2 ASG: one for compute ("c5.9xlarge") and the other gpu ("p3.2xlarge"). Both are configured as Spot and set with desiredCapacity 0.

K8S CA works as expected and scale out each ASG when necessary, the issue is that the newly created gpu instance is not recognized by the master and running kubectl get nodes emits nothing. I can see that the ec2 instance was in Running state and also I could ssh the machine.

I double checked the the labels and tags and compared them to the "compute". Both are configured almost similarly, the only difference is that the gpu nodegroup has few additional tags.

Since I'm using eksctl tool (v.0.35.0) and the compute nodeGroup vs. gpu nodeGroup is basically copy&paste, I can't figured out what could be the problem.

UPDATE: ssh the instance I could see the following error (/var/log/messages)

failed to run Kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"

and the kubelet service crashed.

would it possible the my GPU uses wrong AMI (amazon-eks-gpu-node-1.18-v20201211)?

If you can ssh into the node, then you'll want to capture the logs to see why it is not joining; no one can _guess_ what is wrong with that setup — mdaniel, Dec 30 '20 at 02:07
@mdaniel just a figure of speech. What log file should I look for? — Cowabunga, Dec 30 '20 at 06:32

score 1 · Answer 1 · answered May 30 '21 at 22:44

1

As a simple you can use this preBootstrapCommands in eksctl yaml config file:

- name: test-node-group
  preBootstrapCommands: 
   - "sed -i 's/cgroupDriver:.*/cgroupDriver: cgroupfs/' /etc/eksctl/kubelet.yaml"

answered May 30 '21 at 22:44

Wael Gaith

11
1

score 0 · Answer 2 · answered Dec 30 '20 at 06:48

0

There is some issue with EKS 1.16, even the graviton processors machine won't join the cluster. To fix it first you try upgrading your CNI version. Please refer the documentation here:

https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

And if that doesn't work, then upgrade your EKS version to the latest available version then should work.

answered Dec 30 '20 at 06:48

Vikrant Dubey

86
3

I'm using amazon-k8s-cni-init:v1.7.5-eksbuild.1 amazon-k8s-cni:v1.7.5-eksbuild.1. and creating new EKS cluster v1.18 results the same disappointing outcome. – Cowabunga Dec 30 '20 at 13:22

score 0 · Answer 3 · answered Dec 31 '20 at 08:47

I've found out the issue. It seems to be mis-alignment between eksctl (v0.35.0) and the AL2-GPU AMI.

AWS team change the control group in docker to be "systemd" instead of "cgroup" (github) while the eksctl tool I used didn't absorb the changes.

A temporary solution is to edit the /etc/eksctl/kubelet.yaml file using preBootstrapCommands

gpu worker node unable to join cluster

3 Answers3