Docker + Ansible + google-compute-engine debian packages - a confusing story

Question

Dear fellow Server Faulters!

I have stumbled upon a problem that confuses me A LOT and you are my only help! Long story short: 1. I have a set of ansible playbooks to configure servers 2. These playbooks are tested and verified against Google Cloud and DigitalOcean 3. This week, playbook starting docker containers on servers started to fail on newest Google Compute instances (eu-west4-a) and only there - old compute instances behave as expected, as well as instances in other ISPs

Ansible fails with following error:

fatal: [example_hostname]: FAILED! => {"changed": false, 
"msg": "Error starting container b57a81e7e1a39da89f91c6d6439b51cf4078f87f5a6997a7dcdd7098f84a7485: 
500 Server Error: Internal Server Error (\"OCI runtime create failed: 
container_linux.go:348: starting container process caused \"process_linux.go:402: 
container init caused \\\"invalid argument\\\"\": unknown\")"}

When the container is started by docker run (with the same parameters as in ansible playbook) container starts successfully. It's also not the case with just one of images - no containers whatsoever can be started via ansible. At the same time, starting them by docker run directly on the server works well.

Docker daemon fails with following error:

time="2018-07-25T11:54:10.809711044+02:00" level=debug msg="Calling GET /_ping"
time="2018-07-25T11:54:10.810917212+02:00" level=debug msg="Calling POST /v1.38/containers/node_exporter/restart"
time="2018-07-25T11:54:10.813468661+02:00" level=debug msg="container mounted via layerStore: &{/var/lib/docker/overlay2/a11e836fe1015928ca1181298f0934a8ee70c122e4eb1e09635b2d31c11eb3b1/merged 0x55808cfe7620 0x55808cfe7620}"
time="2018-07-25T11:54:10.813897061+02:00" level=debug msg="container mounted via layerStore: &{/var/lib/docker/overlay2/a11e836fe1015928ca1181298f0934a8ee70c122e4eb1e09635b2d31c11eb3b1/merged 0x55808cfe7620 0x55808cfe7620}"
time="2018-07-25T11:54:10.820720679+02:00" level=debug msg="EnableService a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 START"
time="2018-07-25T11:54:10.821163095+02:00" level=debug msg="EnableService a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 DONE"
time="2018-07-25T11:54:10.826647553+02:00" level=debug msg="bundle dir created" bundle=/var/run/docker/containerd/a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 module=libcontainerd namespace=moby root=/var/lib/docker/overlay2/a11e836fe1015928ca1181298f0934a8ee70c122e4eb1e09635b2d31c11eb3b1/merged
time="2018-07-25T11:54:10+02:00" level=debug msg="event published" ns=moby topic="/containers/create" type=containerd.events.ContainerCreate
time="2018-07-25T11:54:10+02:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0/shim.sock" debug=true pid=7576
time="2018-07-25T11:54:10+02:00" level=debug msg="registering ttrpc server"
time="2018-07-25T11:54:10+02:00" level=debug msg="serving api on unix socket" socket="[inherited from parent]"
time="2018-07-25T11:54:10+02:00" level=info msg="shim reaped" id=a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0
time="2018-07-25T11:54:10.870453438+02:00" level=error msg="stream copy error: reading from a closed fifo"
time="2018-07-25T11:54:10.870900093+02:00" level=error msg="stream copy error: reading from a closed fifo"
time="2018-07-25T11:54:10+02:00" level=debug msg="event published" ns=moby topic="/containers/delete" type=containerd.events.ContainerDelete
time="2018-07-25T11:54:10.883501549+02:00" level=error msg="a31ae80856cef0bc72298b68660739e04bc710cd815ce2a746d7276065464ab0 cleanup: failed to delete container from containerd: no such container"
time="2018-07-25T11:54:10.885529863+02:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Cannot restart container node_exporter: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"invalid argument\\\"\": unknown" error_type="*errors.errorString" module=api
time="2018-07-25T11:54:10.885962960+02:00" level=error msg="Handler for POST /v1.38/containers/node_exporter/restart returned error: Cannot restart container node_exporter: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"invalid argument\\\"\": unknown"
time="2018-07-25T11:54:10.886356247+02:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Cannot restart container node_exporter: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"invalid argument\\\"\": unknown" error_type="*errors.errorString" module=api

None of these when running via docker run. Originally thought to be an issue related to ansible or python, but having installed the same set of packages and libraries in exactly the same versions made me look forward. So ultimately, what allows me to be starting docker containers via ansible again is issuing...

apt purge -y google-compute-engine google-compute-engine-oslogin

Once these packages are gone and instance is rebooted, everything works flawlessly!!! Verifying with other servers from GCE, current versions of these packages are:

ii  google-compute-engine                 2.8.3-1                        all          Google Compute Engine guest environment.
ii  google-compute-engine-oslogin         1.3.0-1+deb9                   amd64        Google Compute Engine OS Login

From other instance where everything works as expected:

ii  google-compute-engine                 2.7.6-1                        all          Google Compute Engine guest environment.
ii  google-compute-engine-oslogin         1.1.5-1+deb9                   amd64        Google Compute Engine OS Login

I can't downgrade these packages since they apparently are not present in repo anymore.

Now regarding the OSLogin - I am not using this feature, and explicitly blocking it through instance metadata (enable-oslogin = FALSE) does not change anything. The only reliable "fix" to this problem is to remove packages - but since it's "sudden" break, it does not feel like proper approach.

As well as after removing these packages (& rebooting - SSHd restart would probably be sufficient). Executing

/usr/bin/google_oslogin_control deactivate

Also does not fix anything (even after reboot of whole VM). I have ran out of ideas what to check and what to look at - I feel that removing the packages is not really the way to go, since they are present on other VMs and it does seem like a recent change of behaviour somewhere I can not pinpoint where.

Any kind of reply is welcome and appreciated!

With respect to "...This week, playbook starting docker containers on servers started to fail on newest Google Compute instances ..." you might want to open and issue with GCE. Probably "Create new Container Registry issue"? https://issuetracker.google.com/issues/new?component=187156&template=1009615 — Vladimir Botka, Jul 28 '18 at 06:08
I've opened the issue https://issuetracker.google.com/issues/111907041 — Vladimir Botka, Jul 28 '18 at 06:12
Hey @VladimirBotka - thank you for your comment and opening the issue. I'm sorry if the question did not make it clear, but I am not using Container Registry (or any container functionality from Google) - it's simple Compute Instance with docker installed on it. Also I did not bother opening issue with Google for I simply have really poor experience with their support and I went straight here (which is suggested by them anyway) — paranoid, Jul 30 '18 at 09:27
@paranoid I am looking into the report at [issuetracker](https://issuetracker.google.com/issues/111907041). It would be nice to have simple detailed reproduction steps to investigate the guest environment behavior mentioned here. And is this only happening in europe-west4-a ? Thank you. — Fady, Jul 30 '18 at 20:30

score 0 · Accepted Answer · answered Aug 24 '18 at 07:58

As of per Google's issue tracker: https://issuetracker.google.com/issues/111907041

The cause of this problem was long hostname (50 characters in my case) - reducing it has solved the problem and allowed containers to start. I invite everyone interested to look into Google issue for more details.

Docker + Ansible + google-compute-engine debian packages - a confusing story

1 Answers1