RollingUpdate on kubernetes does not prevent Gateway Timeout

Question

I've followed http://rahmonov.me/posts/zero-downtime-deployment-with-kubernetes/ blog, created two docker images with index.html returning 'Version 1 of an app' and 'Version 2 of an app'. What I want to achieve is zero downtime release. I'm using

kubectl apply -f mydeployment.yaml

with image: mynamespace/nodowntime-test:v1 inside.

to deploy v1 version to k8s and then run:

while True
    do
            printf "\n---------------------------------------------\n"
            curl "http://myhosthere"
            sleep 1s
    done

So far everything works. After short time curl returns 'Version 1 of an app'. Then I'm applying same k8s deployment file with image: mynamespace/nodowntime-test:v2. And well, it works, but there is one ( always one ) Gateway Timeout response between v1 and v2. So its not really no downtime release ; ) It is much better than without RollingUpdate but not perfect.

I'm using RollingUpdate strategy and readinessProbe:

---                              
apiVersion: apps/v1              
kind: Deployment                 
metadata:                        
  name: nodowntime-deployment    
spec:                            
  replicas: 1                    
  strategy:                      
    type: RollingUpdate          
    rollingUpdate:               
      maxUnavailable: 0          
      maxSurge: 1                
  selector:                      
    matchLabels:                 
      app: nodowntime-test       
  template:                      
    metadata:                    
      labels:                    
        app: nodowntime-test     
    spec:                        
      containers:                
      ...                        
        readinessProbe:          
          httpGet:               
            path: /              
            port: 80             
          initialDelaySeconds: 5 
          periodSeconds: 5       
          successThreshold: 5

Can I do it better? Is it some issue with syncing all of that with ingress controller? I know I can tweak it by using minReadySeconds so old and new pod overlaps for some time but is it the only solution?

score 4 · Accepted Answer · edited Sep 05 '18 at 09:23

I've recreated the mentioned experiment and changed the number of requests to something close to 30 per second by starting the three simultaneous processes of the following:

While True
    do
        curl -s https://<NodeIP>:<NodePort>/ -m 0.1 --connect-timeout 0.1 | grep Version || echo "fail"
done

After editing deployment and changing image version several times, there was no packet loss at all during the transition process. I even caught a short moment of serving requests by both images at the same time.

  Version 1 of my awesome app! Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!

Therefore, if you send the request to service directly, it will work as expected.

“Gateway Timeout” error is a reply from Traefik proxy. It opens TCP connection to backend through a set of iptables rules.
When you do the RollingUpdates, iptables rules have changed but Traefic doesn't know that, so the connection is still considered as open from Traefik point of view. And after the first unsuccessful attempt to go through nonexistent iptables rule Traefik reports "Gateway Timeout" and closes tcp connection. On the next try, it opens a new connection to the backend through the new iptables rule, and everything goes well again.

It could be fixed by enabling retries in Traefik.

# Enable retry sending request if network error
[retry]

# Number of attempts
#
# Optional
# Default: (number servers in backend) -1
#
# attempts = 3

Update:

finally we worked around it without using 'retry' feature of traefik which could potentially need idempotent processing on all services ( which is good to have anyway but we could not afford forcing all projects to do that ). What you need is kubernetes RollingUpdate strategy + ReadinessProbe configured and graceful shutdown implemented in your app.

thx for pointing out traefik and iptables - will check it and get back with feedback. This one 'Gateway Timeout" is not a big deal for me, but would be great to avoid. — hi_my_name_is, Aug 10 '18 at 05:18
thx, it works. We also tweaked https://docs.traefik.io/configuration/commons/#forwarding-timeouts defaultTimeout to 5s so client get response faster. — hi_my_name_is, Aug 10 '18 at 07:05
Hey, `:/` implies service type of `NodePort` but traefik ingress as far as I can tell is supposed to use service type of `ClusterIP` (since you can't have same port on different apps on the same node otherwise). Did you just set up a separate, normally unused service for testing? Or is this `NodePort` is actually `traefik` service, not the application service? — Andrew Savinykh, Mar 18 '19 at 22:45
First part of the answer is about service object type:nodePort between application pods and traefik ingress. NodePort is a set of rules in iptables in addition to ClusterIp rules which forward requests that come to node interface. ClusterIp set of rules is still present and work in exactly the same way as if service has type:ClusterIp. In case of type:NodePort you can send requests to Cluster Ip and node port at the same time. I hope I answered your question. — VAS, Mar 19 '19 at 07:50

RollingUpdate on kubernetes does not prevent Gateway Timeout

1 Answers1