Cluster reconciliation in the event of node loss

Question

I have a cluster of 3 nodes that I'd like to recover fast after a single node loss. By recovering I mean that I resume communication with my service after a reasonable amount of time (preferably configurable).

Following are various details:

k8s version:

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T10:00:30Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T09:42:05Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

I have a service distributed over all 3 nodes. With one node failing I observe the following behavior:

api server fails over to another node and kubernetes service endpoint shows the correct IP address (custom fail-over).
api server is not responding on 10.100.0.1 (its cluster IP)
after some time, all relevant service endpoints are cleared (e.g. in kubectl get ep --namespace=kube-system shows no ready addresses for all endpoints)
the service in question is not available on the service IP (due to the above)

The service has both readiness/liveness probes and only a single instance is ready at any given time with all being live. I've checked that the instance that is supposed to be available is also available - i.e. both ready/live.

This continues for more than 15min before the service Pod that was running on the lost node receives a NodeLost status, at which point the endpoints are re-populated, and I can access the service as usual.

I have tried fiddling with pod-eviction-timeout, node-monitor-grace-period settings to no avail - the time is always roughly the same.

Hence, my questions:

Where can I read up on the behavior of the key k8s components in case of a node loss in detail?
What would be the combination of parameters to reduce the time it takes the cluster to reconcile since this is supposed to be used in a test?

Can you clarify a few things. Are your master components (kube-apiserver, kube-controller-manager) on all 3 nodes? What instructions did you use to setup replicated apiserver and controller manager? Where is etcd? On all 3 nodes too? Which service are you talking about not responding? IIRC, the kube-apiserver is not behind a Kubernetes service in an HA configuration, but behind some other kind of load balancer. I'm not sure what 10.100.0.1 means in your setup. — Eric Tune, Jun 27 '17 at 22:31
Yes, these services are on all 3 nodes but only a single node is running the api server subset (apiserver/controller-manager/scheduler) at a time. The replication is custom and uses etcd-backed leader election to choose which node runs the apiserver. Etcd is also distributed over these 3 nodes. The non-responding service is a user service but the point is that all endpoints are considered not ready and I guess, hence, the service itself is not available (e.g. no Pod can be reached using the service IP). Yes, sorry. 10.100.0.1 is the kubernetes service cluster IP (from 10.100.0.0/16) — deemok, Jun 28 '17 at 21:43

Cluster reconciliation in the event of node loss

0 Answers0