In my project we have etcd DB deployed on Kubernetes (this etcd is for application use, separate from the Kubernetes etcd) on on-prem. So I deployed it using the bitnami helm chart as a statefulset. Initially, at the time of deployment, the number of replicas was 1 as we wanted a single instance of etcd DB earlier.
The real problem started when we scaled it up to 3. I updated configuration to scale it up by updating the ETCD_INITIAL_CLUSTER with two new members DNS name:
etcd-0=http://etcd-0.etcd-headless.wallet.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380
Now when I go inside any of etcd pod and run etcdctl member list I only get a list of member and none of them is selected as leader, which is wrong. One among three should be the leader.
Also after running for some time these pods start giving heartbeat exceeds error and server overload error:
W | etcdserver: failed to send out heartbeat on time (exceeded the 950ms timeout for 593.648512ms, to a9b7b8c4e027337a
W | etcdserver: server is likely overloaded
W | wal: sync duration of 2.575790761s, expected less than 1s
I changed the heartbeat default value accordingly, the number of errors decreased but still, I get a few heartbeat exceed errors along with others.
Not sure what is the problem here, is it the i/o that's causing the problem? If yes I am not sure how to be sure.
Will really appreciate any help on this.