Kubernetes pods failing all at once

Question

I have a really simple flask application running on kubernetes (GKE). The pods get a fair amount of traffic (60req/s +-) and they run under an autoscaling group with a minimum of 4 active and 10 max.

At every 4-5 hours the liveness probe starts failing and all pods get restarted. I sometimes find that my pods got restarted 11-12 times during a single night. When I describe the pods I get the same error:

Liveness probe failed: Get http://10.12.5.23:5000/_status/healthz/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

All pods have the same number of restarts so it's not a load issue (and I also have autoscaling).

The _status/healthz/ endpoint is as simple as it gets:

@app.route('/')
@app.route('/_status/healthz/')
def healthz():
    return jsonify({
        "success": True
    })

I have one other route on this application which connects to mysql and verifies some data. I had the same applications distributed on digitalocean droplets running under much higher load for months without issues.

I can't seem to find out why the liveness checks start failing al lat once and my pods get restarted.

The allocated resources are also decent and really close with what I had on digitalocean droplets:

"resources": {
    "requests": {
        "cpu": "500m",
        "memory": "1024Mi"
    },
    "limits": {
        "cpu": "800m",
        "memory": "1024Mi"
    }
}

I had the same pods running with 100m for cpu limits and with 900m. Same result, every few hours all pods are restarting.

Liveness settings:

"livenessProbe": {
    "initialDelaySeconds": 30,
    "httpGet": {
        "path": "/_status/healthz/",
        "port": 5000
    },
    "timeoutSeconds": 5
},

UPDATE: added Readiness probe, increased CPU = same results, 7 restarts on each of the 4 pods.

It is unlikely to be network problem, since the liveness probes are being done locally on the node. — Eric Tune, Feb 16 '17 at 19:22
Are you using the stock VM image, or have you customized your node image? Is there some per-node process running periodically, at the same time on all nodes, which is causing the kubelet to not be able to complete the probes? Like a log rotation cron job or something? — Eric Tune, Feb 16 '17 at 19:24
This may be related to http://stackoverflow.com/questions/42232661/occasionally-pods-will-be-created-with-no-network-which-results-in-the-pod-faili — Eric Tune, Feb 16 '17 at 21:56
I think it may be related to https://github.com/benoitc/gunicorn/issues/1194 actually. I'll update the pods today and let you know. — Romeo Mihalcea, Feb 17 '17 at 09:27
have you tried to increase timeoutSeconds for readiness probe? Should help — Vit, Sep 18 '18 at 13:23
@RomeoMihalcea Please update if you got any solution. I am too facing same issue liveness rediness faliling for flask application on GKE production on staging everything is running well. — Harsh Manvar, Feb 01 '20 at 05:13

Kubernetes pods failing all at once

0 Answers0