0

I am using cr-defunct checkpoint restore (based on feedback from Ross Boucher) to build 1.10.0-dev from source to get checkpoint/restore functionality.

When I checkpoint a container without any active TCP connections, and then restore it into a newly created one, I have no problems. However, if there is an active TCP connection, the restore fails. It is possible that the failure is because of other reasons... I am not sure. But the TCP failure pops out in the restore.log. Here is how I cause this to happen

Start a docker container (I use alpine-sshd) as the base image

docker run -d --security-opt seccomp:unconfined --name a1 alpine-sshd

Then, I ssh into the container. I have already setup the user

ssh abc@172.17.0.2

So, now there is an active TCP connection on port 22 for that container, which I can verify by entering the container and performing a "netstat -na" inside the container

Now, I create a new container (not start it) using

docker create --security-opt seccomp:unconfined --name=a3 alpine-sshd

"docker ps -a" reveals two containers, a1 and a3

Next, I checkpoint the a1 container using the checkpoint option. The --leave-running flag has no impact since it is not used in the restore, where the actual error lies

docker checkpoint --image-dir=/tmp/ABC a1

Then I restore using /tmp/ABC

docker restore --force=true --image-dir=/tmp/ABC a3

This causes the following error

Error response from daemon: Cannot restore container a3: cantstart: Cannot start container c40adc.....<snip ID>...: criu failed: type NOTIFY error 0
log file: /var/lib/docker/0.0/containers/c40adc...<snip ID>../criu.work/restore.log

The restore.log has the following notable errors:

14: Restoring TCP connection
14: Restoring TCP connection id 13 ino 153c9
14:      Setting 1 queue seq to 2533629009
14:      Setting 2 queue seq to 1507997351
14: Error (sk-inet.c:721): Can't bind inet socket (id 19): Cannot assign requested address
10: Error (cr-restore.c:1350): 14 exited, status=1

At the bottom of the log file

10: Restored
Error (cr-restore.c:1352): 20710 killed by signal 9
Error (cr-restore.c:2182): Restore failed

Now, I don't need the networking necessarily to be restored - although it would be useful to have. Right now, I just want a stable restore on a previously checkpointed image that had active networking connections.

NOTE that if I do this entire sequence without the ssh/TCP connection, it works nicely.

Any help will be greatly appreciated. I can provide full restore.log and other files, if needed. Thanks in advance

userVK
  • 13
  • 5
  • As a somewhat hackier answer, since you explicitly state you don't necessarily need the connection to be restore: you can build a custom version of criu with an early return before the call to bind() in inet_bind() (sk-inet.c). – Ross Boucher Jun 13 '16 at 21:34
  • Just FYI - so this is not the 'traditional' C/R for which there are a decent number of articles online. My aim is actually to proxy TCP connections into a container `a1`, and then, on-demand, move some of the connections to be proxied to a different container `a3`. So, a subset of connections still terminate on `a1` and the intention is to keep it running. Clearly, the ipaddress of the new container `a3` has to be different. So long as the application state is checkpointed from `a1` and moved to `a3`, I should be able to get something viable going for what I want to achieve. – userVK Jun 14 '16 at 03:03

1 Answers1

0

The most likely explanation I think is that both containers need to have the same IP address in order for tcp connection restoring to work. Unfortunately that's not easily achievable with docker 1.10.

One thing you could try is building the new 1.12 based version of checkpoint/restore available in the "docker-checkpoint-restore" branch of my github repo (I'll try and make a pre-compiled release soon). Docker 1.12 lets you request an IP when creating a container. The checkpoint/restore API has changed slightly:

# to create
$ docker checkpoint create <container_id> <checkpoint_name>

# to restore
$ docker start --checkpoint <checkpoint_name> <container_id>

Note that if you want to create a new container in this new system, currently you'll have to copy the checkpoints directory located in /var/lib/docker/containers/<container_id>/checkpoints.

Ross Boucher
  • 704
  • 2
  • 6
  • 10
  • Thanks. So, I compiled the docker-checkpoint-restore using the same [instructions from SaiedKazemi](http://stackoverflow.com/questions/34278619/docker-suspend-and-resume-using-criu/34318665). I also had to clone containerd and compile it to create `containerd`, `containerd-shim` and `ctr`. I also had to install `runc` (I used apt-get install to do so). Invocation of dockerd using `sudo dockerd` went OK, but only if I created symbolic links `docker-containerd` and `docker-containerd-shim`. After that, `docker run` complained about not finding `docker-runc` - resolved that using symbolic links – userVK Jun 14 '16 at 02:47
  • Finally, I am able to invoke `docker run`, but it fails - complaining about `shim error: open pid: no such file or directory`. There are a couple of posts on it on [github docker](https://github.com/docker/docker/issues/23467), but no clear resolutions. I will try to track the problem down. – userVK Jun 14 '16 at 02:52