I want to implement a shutdown-script
that would be called when my VM is going to be preempted on Google Compute Engine. That VM is used to run dockers containers that execute long running batches, so I send them a signal to make them gracefully exit.
That shutting-down script is working well when I execute it manually, yet it breaks on a real premption use-case, or when I kill the VM by myself.
I got this error:
... logs from my containers ...
A 2019-08-13T16:54:07.943153098Z time="2019-08-13T16:54:07Z" level=error msg="error waiting for container: unexpected EOF"
(just after this error, I can see what I put in the 1st line of my shutting-down script, see code below)
A 2019-08-13T16:54:08.093815210Z 2019-08-13 16:54:08: Shutting down! TEST SIGTERM SHUTTING DOWN (this is the 1st line of my shuttig-down script)
A 2019-08-13T16:54:08.093845375Z docker ps -a
(no reult)
A 2019-08-13T16:54:08.155512145Z ps -ef
... a lot of things, but nothing related to docker ...
2019-08-13 16:54:08: task_subscriber not running, shutting down immediately.
I use preemptible VM from GCE, with image Container-Optimized OS 73-11647.267.0 stable
. I run my dockers as service with systemctl
, yet I don't thik this is related - [edit] Actually I could solve my issue thanks to this.
Right now, I am pretty sure that a lot of things happens when Google send the ACPI signal to my VM, even before the shutdown-script is fetched from the VM metadata and is called.
My guess is that all the services are being stopped at the same time, eventually docker.service
itself.
When my container is running, I can get the same level=error msg="error waiting for container: unexpected EOF"
with a simple sudo systemctl stop docker.service
Here is a part of my shuting-down
script:
#!/bin/bash
# This script must be added in the VM metadata as "shutdown-script" so that
# it is executed when the instance is being preempted.
CONTAINER_NAME="task_subscriber" # For example, "task_subscriber"
logTime() {
local datetime="$(date +"%Y-%m-%d %T")"
echo -e "$datetime: $1" # Console
echo -e "$datetime: $1" >>/var/log/containers/$CONTAINER_NAME.log
}
logTime "Shutting down! TEST SIGTERM SHUTTING DOWN"
echo "docker ps -a" >>/var/log/containers/$CONTAINER_NAME.log
docker ps -a >>/var/log/containers/$CONTAINER_NAME.log
echo "ps -ef" >>/var/log/containers/$CONTAINER_NAME.log
ps -ef >>/var/log/containers/$CONTAINER_NAME.log
if [[ ! "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; then
logTime "${CONTAINER_NAME} not running, shutting down immediately."
sleep 10 # Give time to send logs
exit 0
fi
logTime "Sending SIGTERM to ${CONTAINER_NAME}"
#docker kill --signal=SIGTERM ${CONTAINER_NAME}
systemctl stop taskexecutor.service
# Portable waitpid equivalent
while [[ "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; do
sleep 1
logTime "Waiting for ${CONTAINER_NAME} termination"
done
logTime "${CONTAINER_NAME} is done, shutting down."
logTime "TEST SIGTERM SHUTTING DOWN BYE BYE"
sleep 10 # Give time to send logs
If I simply call systemctl stop taskexecutor.service
manually (not by really shutting down the server), the SIGTERM signal is sent to my docker and my app properly handle it and exists.
Any idea?
-- How I solved my issue --
I could solve it by adding this dependency on docker in my service config:
[Unit]
Wants=gcr-online.target docker.service
After=gcr-online.target docker.service
I don't know how the magic works beyond the execution of the shutdown-script
stored in metadata by Google. But I think that they should fix something in their Container-Optimized OS
so that that magic happens before docker is stopped. Otherwise, we could not rely on it to gracefully shutdown a basic script with it (hopefully I was using systemd here)...