Last month we had an outage caused by the AKS Scheduler going down. Commands such as kubectl
were still working but pods weren't starting. When we contacted AKS, they eventually "restarted the API server" which resolved this issue.
It definitely makes me a little worried that we could lose something as critical as the scheduler and we have to call to ask Azure to fix it.
Azure has made the Control Plane opaque from within the cluster. The API server, scheduler, and controller are not even listed as objects. We are working on a simple healthcheck pod that would start up and send a ping to Datadog saying "I'm alive", however, I tend to think that Azure should be providing someway to monitor or view the health of these services.
Has anyone come up with a better method of monitoring these processes?