Dear fellow Server Faulters!
I have stumbled upon a problem that confuses me A LOT and you are my only help! Long story short: 1. I have a set of ansible playbooks to configure servers 2. These playbooks are tested and verified against Google Cloud and DigitalOcean 3. This week, playbook starting docker containers on servers started to fail on newest Google Compute instances (eu-west4-a) and only there - old compute instances behave as expected, as well as instances in other ISPs
Ansible fails with following error:
fatal: [example_hostname]: FAILED! => {"changed": false,
"msg": "Error starting container b57a81e7e1a39da89f91c6d6439b51cf4078f87f5a6997a7dcdd7098f84a7485:
500 Server Error: Internal Server Error (\"OCI runtime create failed:
container_linux.go:348: starting container process caused \"process_linux.go:402:
container init caused \\\"invalid argument\\\"\": unknown\")"}
When the container is started by docker run (with the same parameters as in ansible playbook) container starts successfully. It's also not the case with just one of images - no containers whatsoever can be started via ansible. At the same time, starting them by docker run directly on the server works well.
Docker daemon fails with following error:
time="2018-07-25T11:54:10.809711044+02:00" level=debug msg="Calling GET /_ping"
time="2018-07-25T11:54:10.810917212+02:00" level=debug msg="Calling POST /v1.38/containers/node_exporter/restart"
time="2018-07-25T11:54:10.813468661+02:00" level=debug msg="container mounted via layerStore: &{/var/lib/docker/overlay2/a11e836fe1015928ca1181298f0934a8ee70c122e4eb1e09635b2d31c11eb3b1/merged 0x55808cfe7620 0x55808cfe7620}"
time="2018-07-25T11:54:10.813897061+02:00" level=debug msg="container mounted via layerStore: &{/var/lib/docker/overlay2/a11e836fe1015928ca1181298f0934a8ee70c122e4eb1e09635b2d31c11eb3b1/merged 0x55808cfe7620 0x55808cfe7620}"
time="2018-07-25T11:54:10.820720679+02:00" level=debug msg="EnableService a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 START"
time="2018-07-25T11:54:10.821163095+02:00" level=debug msg="EnableService a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 DONE"
time="2018-07-25T11:54:10.826647553+02:00" level=debug msg="bundle dir created" bundle=/var/run/docker/containerd/a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 module=libcontainerd namespace=moby root=/var/lib/docker/overlay2/a11e836fe1015928ca1181298f0934a8ee70c122e4eb1e09635b2d31c11eb3b1/merged
time="2018-07-25T11:54:10+02:00" level=debug msg="event published" ns=moby topic="/containers/create" type=containerd.events.ContainerCreate
time="2018-07-25T11:54:10+02:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0/shim.sock" debug=true pid=7576
time="2018-07-25T11:54:10+02:00" level=debug msg="registering ttrpc server"
time="2018-07-25T11:54:10+02:00" level=debug msg="serving api on unix socket" socket="[inherited from parent]"
time="2018-07-25T11:54:10+02:00" level=info msg="shim reaped" id=a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0
time="2018-07-25T11:54:10.870453438+02:00" level=error msg="stream copy error: reading from a closed fifo"
time="2018-07-25T11:54:10.870900093+02:00" level=error msg="stream copy error: reading from a closed fifo"
time="2018-07-25T11:54:10+02:00" level=debug msg="event published" ns=moby topic="/containers/delete" type=containerd.events.ContainerDelete
time="2018-07-25T11:54:10.883501549+02:00" level=error msg="a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 cleanup: failed to delete container from containerd: no such container"
time="2018-07-25T11:54:10.885529863+02:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Cannot restart container node_exporter: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"invalid argument\\\"\": unknown" error_type="*errors.errorString" module=api
time="2018-07-25T11:54:10.885962960+02:00" level=error msg="Handler for POST /v1.38/containers/node_exporter/restart returned error: Cannot restart container node_exporter: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"invalid argument\\\"\": unknown"
time="2018-07-25T11:54:10.886356247+02:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Cannot restart container node_exporter: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"invalid argument\\\"\": unknown" error_type="*errors.errorString" module=api
None of these when running via docker run. Originally thought to be an issue related to ansible or python, but having installed the same set of packages and libraries in exactly the same versions made me look forward. So ultimately, what allows me to be starting docker containers via ansible again is issuing...
apt purge -y google-compute-engine google-compute-engine-oslogin
Once these packages are gone and instance is rebooted, everything works flawlessly!!! Verifying with other servers from GCE, current versions of these packages are:
ii google-compute-engine 2.8.3-1 all Google Compute Engine guest environment.
ii google-compute-engine-oslogin 1.3.0-1+deb9 amd64 Google Compute Engine OS Login
From other instance where everything works as expected:
ii google-compute-engine 2.7.6-1 all Google Compute Engine guest environment.
ii google-compute-engine-oslogin 1.1.5-1+deb9 amd64 Google Compute Engine OS Login
I can't downgrade these packages since they apparently are not present in repo anymore.
Now regarding the OSLogin - I am not using this feature, and explicitly blocking it through instance metadata (enable-oslogin = FALSE) does not change anything. The only reliable "fix" to this problem is to remove packages - but since it's "sudden" break, it does not feel like proper approach.
As well as after removing these packages (& rebooting - SSHd restart would probably be sufficient). Executing
/usr/bin/google_oslogin_control deactivate
Also does not fix anything (even after reboot of whole VM). I have ran out of ideas what to check and what to look at - I feel that removing the packages is not really the way to go, since they are present on other VMs and it does seem like a recent change of behaviour somewhere I can not pinpoint where.
Any kind of reply is welcome and appreciated!