2

I have a deployment with 2 replicas of nginx with openconnect vpn proxy container (a pod has only one container).

They start without any problems and everything works, but once the connection crashes and my liveness probe fails, the nginx container is restarted ending up in CrashLoopbackoff because the openconnect and nginx restart fails with

nginx:

host not found in upstream "example.server.org" in /etc/nginx/nginx.conf:11

openconnect:

getaddrinfo failed for host 'vpn.server.com': Temporary failure in name resolution

It seems like the /etc/resolv.conf is edited by openconnect and on the pod restart it stays the same (altough it is not a part of a persistent volume) and I believe the whole container should be run from a clean docker image, where the /etc/resolv.conf is not modified, right?

The only way how to fix the CrashLoopback is to delete the pod and the deployment rc runs a new pod that works.

How is it different to create a new pod vs. when the container in pod is restarted by the liveness probe restartPolicy: Always? Is the container restarted with a clean image?

bartimar
  • 3,374
  • 3
  • 30
  • 51
  • Sorry, missing a point here. Do you provide a custom resolv.conf file in your containers? – whites11 Dec 26 '17 at 14:00
  • @whites11 No, I don't. But openconnect modifies it during runtime. When the container is restarted afterwards, it seems like the resolv.conf stays modified and dns resolver does not work... – bartimar Dec 26 '17 at 18:16

1 Answers1

1

restartPolicy applies to all Containers in the Pod, not the pod itself. Pods usually only get re-created when someone explicitly deletes them.

I think this explains why the restarted container with the bad resolv.conf fails but a new pod works.

A "restarted container" is just that, it is not spawned new from the downloaded docker image. It is like killing a process and starting it - the file system for the new process is the same one the old process was updating. But a new pod will create a new container with a local file system view identical to the one packaged in the downloaded docker image - fresh start.

navicore
  • 1,979
  • 4
  • 34
  • 50
  • The pod has only one container. How does it explain why the restarted container fails and new pod works? Shouldn't the restarted container run a clean image? – bartimar Dec 26 '17 at 18:17
  • I edited the question for less confusion about the container/pod terminology. – bartimar Dec 26 '17 at 18:19
  • A "restarted container" is just that, it is not spawned new from the downloaded docker image. It is like killing a process and starting it - the file system for the new process is the same one the old process was updating. But a new pod will create a new container with a local file system view identical to the one packaged in the downloaded docker image - fresh start. – navicore Dec 26 '17 at 18:24
  • So it is a parallel to docker stop + docker start? Is the entrypoint being executed? Or just CMD? Any docs on the implementation? :( – bartimar Dec 26 '17 at 18:28
  • I see there is even docker restart command https://docs.docker.com/engine/reference/commandline/restart/ but what the hell does that do exactly... gotta love these docs... – bartimar Dec 26 '17 at 18:29
  • ha. agree, muddy fine distinctions with these terms. but "yes", like "docker stop + start" for container restart and more like "docker stop + rm + start" for deleting and creating a new pod. "entrypoing/CMD" doesn't change. – navicore Dec 26 '17 at 18:35
  • but when I do docker restart, is the entrypoint being executed again as the container starts? – bartimar Dec 26 '17 at 18:38
  • yes - there is no "snapshot" state of a running process as with some VMs, your only state when you restart is the container's files. Your entrypoint and cmd are rerun. I'm guessing you are correct that your vpn client has modified resolv.conf in a way that stops it from re-connecting out of the cluster after the restart. Maybe try to kubectl exec into the container and show us what the resolv.conf looks like after the client starts? – navicore Dec 26 '17 at 18:44
  • a hack might be to wrap your entrypoint/cmd startup in a script that saves resolv.conf and restores it if and old good one is present. – navicore Dec 26 '17 at 18:46
  • 1
    yep, that is exactly what I did this morning after it failed (it fails about twice a month, so I deleted the pod fast, checked the logs and went on analyzing the problem, but I didn't have the failed container available anymore), I create a backup of resolv.conf in the Dockerfile and replace it in entrypoint before calling openconnect. Hopefully it will work next time. Thanks – bartimar Dec 26 '17 at 18:50
  • What I needed to know was only how the container restart is implemented. Edit your answer and I will close this as answered. Thanks again:) – bartimar Dec 26 '17 at 18:51