Inconsistent behaviour of Kubernetes slaves, few slaves don't show up

Question

I have a kubernetes master setup in AWS, balanced by a ELB. I create 5-6 instances using terraform and provision it to be kube slaves and point kubelets to the ELB. When I run kubectl get nodes, only 3 or 4 instances show up. Looks like slaves registration with master fails for few nodes, but all nodes are identical.

Its random behaviour, some times all slaves show up just fine.

Are you slaves showing up and disappearing from `kubectl get nodes`? If no, what does kubectl describe node show (the slaves heartbeta with the master)? If yes, take a look in your kubelet logs (distro specific, usually var/log or journatd). — Prashanth B, May 19 '16 at 18:16
Have you tried something like this? `kubectl --namespace=kube-system get nodes -a` — Naveen, May 19 '16 at 19:35

score 2 · Answer 1 · answered May 22 '16 at 06:00

Answering my own question -

I name slave nodes with their PrivateIP, and I dynamically spawn slaves, attach it to master, schedule pods, and destroy the slaves after the job is done but I never deleted these nodes from kube. ie: 'kubectl delete node-name'.

All these destroyed slave nodes were in 'Not ready' state with name=PrivateIP.

Now since the slaves are destroyed, the PrivateIP is returned to the AWS IP pool, newly spawned instances can now take those IP's.

Now when I spawn new slaves and try to attach it with master, its possible that few slaves get the same PrivateIP as those slaves that are in 'Not ready' state(since those slaves are destroyed and these IP's are released already).

Hence Kubernetes used to just change the status of the old slave to 'Ready' state, which went unnoticed earlier since I was programatically waiting for new slaves to show up.

Notice:

Destroy meaning terminate AWS instance

Delete meaning detaching the slave from Kubernetes ie. kubectl delete node-name

score 1 · Answer 2 · answered May 21 '16 at 07:37

This might be a race condition, from my own experience with AWS and Terraform.

ELBs usually require more time than EC2 instances to get ready, so if for any reason a kubelet starts before the ELB is able to serve, the node will just fail to register ("host not found" or "error 500" depending on the timing)

You can mitigate that in 2 manners:

make your kubelet service/container restart automatically on failure
create a strict dependency between EC2 instances and the ELB, with a readiness check on the ELB (HTTP call would suffice)

I would need the logs from the kubelet to validate that theory of course.

Inconsistent behaviour of Kubernetes slaves, few slaves don't show up

2 Answers2