20 percent of requests timeout when a node crashing in k8s. How to solve this?

Question

I was testing my kubernetes services recently. And I found it's very unreliable. Here are the situation:
1. The test service 'A' which receives HTTP requests at port 80 has five pods deployed on three nodes.
2. An nginx ingress was set to route traffic outside onto the service 'A'.
3. The ingress was set like this:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-A
  annotations:    
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "1s"
    nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout invalid_header http_502 http_503 http_504"
    nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "2"
spec:
  rules:
  - host: <test-url>
    http:
      paths:
      - path: /
        backend:
          serviceName: A
          servicePort: 80

http_load was started on an client host and kept sending request to the ingress nginx at a speed of 1000 per-seconds. All the request were routed to the service 'A' in k8s and eveything goes well.

When I restarted one of the nodes manually, things went wrong:
In the next 3 minutes, about 20% requests were timeout, which is unacceptable in product environment.

I don't know why k8s reacts so slow and is there a way to solve this problem?

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

You can speed up that fail-over process by configuring liveness and readiness probes in the pods' spec:

Container probes

...

livenessProbe: Indicates whether the Container is running. If the liveness probe fails, the kubelet kills the Container, and the Container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.

readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod. The default state of readiness before the initial delay is Failure. If a Container does not provide a readiness probe, the default state is Success.

Liveness probe example:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

score 1 · Answer 2 · answered Mar 04 '19 at 15:45

1

Thanks for @VAS's answer!
Liveness probe is a way to solve this problem.
But I finally figured out that what I want was passive health check, which the k8s dosen't surpport.
And I solved this problem by introducing istio into my cluster.

answered Mar 04 '19 at 15:45

George Liang

131
1
10

20 percent of requests timeout when a node crashing in k8s. How to solve this?

2 Answers2

Container probes