1

I am experiencing a very complicated issue with Kubernetes in my production environments losing all their Agent Nodes, they change from Ready to NotReady, all the pods change from Running to NodeLost state. I have discovered that Kubernetes is making intensive usage of disks:

Agent Node Usage

Agent Node Usage 2

Kubectl get nodes

enter image description here

My cluster is deployed using acs-engine 0.17.0 (and I tested previous versions too and the same happened).

On the other hand, we decided to deploy the Standard_DS2_VX VM series which contains Premium disks and we incresed the IOPS to 2000 (It was previously under 500 IOPS) and same thing happened. I am going to try with a higher number now.

Any help on this will be appreaciated.

Amrit
  • 1,964
  • 19
  • 25
  • We faced the same issue on acs. The support couldn’t do much and recommend to create new cluster! We increased IO as well with bigger machines. But nothing worked. – Amrit May 23 '18 at 05:57
  • Thanks for comment this out. Hey how did you proceed with this? Do you still run under kubernetes? I could not receive well answers on this from the community. We are running almost 3 weeks and we created at least 5 times the cluster changing parts of hardware. Nothing worked. – Hugo Marcelo Del Negro May 23 '18 at 13:05
  • I recreated the clusters with bigger VMs (more i/o etc.) that kind of fixed it temporarily. You can do the same and pick the max IO you can afford for the time being. Also we have i/o metrics which pushes alerts to our team chat app. This helps to fix before production goes down. We then scale up the cluster. So that k8s reschedules pods to healthy nodes. Also we run multiple replicas of each app on every node - keeps the service stable on node failures. Not sure if it’s an Azure issue, but we are thinking of migrating to AWS. – Amrit May 23 '18 at 13:11
  • Azure have a lot of issues - may be an internal azure process is creating this mess, like https://github.com/Microsoft/OMS-Agent-for-Linux/issues/632 . I tried to track this culprit down but failed and eventually gave up. We even “bought” tech support subscription from Azure for this issue - but of no use. – Amrit May 23 '18 at 13:21
  • I am thinking on moving to AWS as well or GCP, I posted an issue into Kubernetes Github showing this in order to verify that it is not going to happen in another cloud provider as well. Please if you have 5 minutes give me some support there https://github.com/kubernetes/kubernetes/issues/64108 to push this response and realize how good is moving around another cloud provider... – Hugo Marcelo Del Negro May 23 '18 at 13:52

1 Answers1

0

It was a microservice exhauting resources and then Kubernetes just halt the nodes. We have worked on establishing resources/limits based so we can avoid the entire cluster disruption.