0

I have a one-node swarm (A) that has a couple long-running services in it. I recently added a new node (B), initially as a worker, but then promoted it to a manager. However, now the swarm is broken and docker on A simply will not restart. I recovered the new added node B by deleting /var/lib/docker on it.

The issue began when I ran this in the B node:

# docker service create --name storage-2 --hostname minio-2.media-server --constraint 'node.labels.storage==true' --publish published=9020,target=9000 --replicas 1 minio/minio server /mnt/ssd-2/minio-2.media-server/storage/
owqwx29gtr5v2ozjrarmf3t08
overall progress: 0 out of 1 tasks 
1/1:   
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
root@media-server:~# docker service ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

I tried rejoining node B to the swarm of A by reapplying the original join token, but I just get this response:

Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.1.148:2377: connect: connection refused"

Now let's move to node A: This is my daemon.json for this node:

# cat /etc/docker/daemon.json
{
 "runtimes": {
   "nvidia": {
     "path": "/usr/bin/nvidia-container-runtime",
     "runtimeArgs": []
   }
 },
 "default-runtime": "nvidia",
 "userns-remap": "my-user",
 "data-root": "/opt/var/docker/"
}

When I do sudo systemctl restart docker is the typical Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

If I run dockerd --debug I get the following:

time="2020-01-07T16:07:47.940589246-05:00" level=info msg="User namespaces: ID ranges will be mapped to subuid/subgid ranges of: my-user:my-user"
time="2020-01-07T16:07:47.940649613-05:00" level=debug msg="Creating user namespaced daemon root: /opt/var/docker/1000.999"
time="2020-01-07T16:07:47.940969396-05:00" level=debug msg="Listener created for HTTP on unix (/var/run/docker.sock)"
time="2020-01-07T16:07:47.941193653-05:00" level=info msg="libcontainerd: started new docker-containerd process" pid=3731
time="2020-01-07T16:07:47-05:00" level=info msg="starting containerd" module=containerd revision=773c489c9c1b21a6d78b5c538cd395416ec50f88 version=v1.0.3 
time="2020-01-07T16:07:47-05:00" level=debug msg="changing OOM score to -500" module=containerd 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.content.v1.content"..." module=containerd type=io.containerd.content.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." module=containerd type=io.containerd.snapshotter.v1 
time="2020-01-07T16:07:47-05:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /opt/var/docker/1000.999/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter" module=containerd 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." module=containerd type=io.containerd.snapshotter.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." module=containerd type=io.containerd.metadata.v1 
time="2020-01-07T16:07:47-05:00" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /opt/var/docker/1000.999/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter" module="containerd/io.containerd.metadata.v1.bolt" 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." module=containerd type=io.containerd.differ.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." module=containerd type=io.containerd.gc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." module=containerd type=io.containerd.monitor.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." module=containerd type=io.containerd.runtime.v1 
time="2020-01-07T16:07:47-05:00" level=debug msg="loading tasks in namespace" module="containerd/io.containerd.runtime.v1.linux" namespace=plugins.moby 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." module=containerd type=io.containerd.grpc.v1 
time="2020-01-07T16:07:47-05:00" level=info msg=serving... address="/var/run/docker/containerd/docker-containerd-debug.sock" module="containerd/debug" 
time="2020-01-07T16:07:47-05:00" level=info msg=serving... address="/var/run/docker/containerd/docker-containerd.sock" module="containerd/grpc" 
time="2020-01-07T16:07:47-05:00" level=info msg="containerd successfully booted in 0.002536s" module=containerd 
time="2020-01-07T16:07:47.951820862-05:00" level=info msg="User namespaces: ID ranges will be mapped to subuid/subgid ranges of: my-user:my-user"
time="2020-01-07T16:07:47.951973737-05:00" level=debug msg="Golang's threads limit set to 230760"
time="2020-01-07T16:07:47.952395645-05:00" level=debug msg="processing event stream" module=libcontainerd namespace=plugins.moby
time="2020-01-07T16:07:47.961837504-05:00" level=debug msg="restored container" alive=true container=36d74b0ce14aef040e5994573e42909679ce20e6d8c7e21c9a5ccfa82b9ef38c module=libcontainerd namespace=plugins.moby pid=1966
time="2020-01-07T16:07:47.993831037-05:00" level=error msg="failed to enable plugin 'vieux/sshfs:latest': error creating containerd container: id already in use"
time="2020-01-07T16:07:47.993929990-05:00" level=debug msg="Using default logging driver json-file"
time="2020-01-07T16:07:47.994191355-05:00" level=debug msg="[graphdriver] priority list: [btrfs zfs overlay2 aufs overlay devicemapper vfs]"
time="2020-01-07T16:07:48.003990866-05:00" level=debug msg="backingFs=extfs,  projectQuotaSupported=false"
time="2020-01-07T16:07:48.004000013-05:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
time="2020-01-07T16:07:48.004006714-05:00" level=debug msg="Initialized graph driver overlay2"
time="2020-01-07T16:07:48.015593752-05:00" level=debug msg="Failed to load mount 074d0e7ec2bbf9e234b77b9f94e5bdcb2d506be83937f4704d5546606e886f00: invalid mount id value"
time="2020-01-07T16:07:48.015693673-05:00" level=debug msg="Failed to load mount 0f4801970ae34e1b3573f3361fdb30e1ae1d0ff9909c4d938a8ac366a8f13162: invalid mount id value"
time="2020-01-07T16:07:48.015838872-05:00" level=debug msg="Failed to load mount 17d3e93832d5f14b6a64dafff6ace99aa823f12c5d014697cd96257269caf49f: invalid mount id value"
time="2020-01-07T16:07:48.015852094-05:00" level=debug msg="Failed to load mount 19a2396129ae436e154ab21f8a55933918e742381db7cf7cfd8263398afdeb4c: invalid mount id value"
time="2020-01-07T16:07:48.015889445-05:00" level=debug msg="Failed to load mount 1aca7d8b62b97b368c34bf01a428c6b45d37f28d8592e5e106f613c805ef4801: invalid mount id value"
time="2020-01-07T16:07:48.015901494-05:00" level=debug msg="Failed to load mount 1cf5677cd23bdfd158b2890f5e4e3b0a33dfe0215676ffc3fb36b261223567af: invalid mount id value"
time="2020-01-07T16:07:48.015966443-05:00" level=debug msg="Failed to load mount 2b7f467b2d9afad7399456a8bb3a5dd74ee1ed6dee7bc4bb7feaeda313d0fce5: invalid mount id value"
time="2020-01-07T16:07:48.016134312-05:00" level=debug msg="Failed to load mount 3b2a3257406cb76615cc0f33698a3a5d2332e91358b526e7858883ebb00054f2: invalid mount id value"
time="2020-01-07T16:07:48.016146780-05:00" level=debug msg="Failed to load mount 449ef233ea403394a1e9ea8d4deb1a74a966dd44256f1dd6833d14853fc3d2a9: invalid mount id value"
time="2020-01-07T16:07:48.016158344-05:00" level=debug msg="Failed to load mount 45369784a3382c4a520f478b8a71a148b79c7d01ad56d1e6af58ed4b17517dcf: invalid mount id value"
time="2020-01-07T16:07:48.016287912-05:00" level=debug msg="Failed to load mount 5b14ff5fe4cdaaa8f8b491d227bf7b4354b02a9b13bba05d9da880b4bc617b96: invalid checksum digest format"
time="2020-01-07T16:07:48.016327990-05:00" level=debug msg="Failed to load mount 6719acf4f4ec553baa73b30a33d1bab4fa7054d6d33d8f28b74e9c466fcfea81: invalid mount id value"
time="2020-01-07T16:07:48.016344990-05:00" level=debug msg="Failed to load mount 6e14a906fca9c1448199bb3626c66057609d7129d6fa5b365dcf759ecf4e4b0d: invalid mount id value"
time="2020-01-07T16:07:48.016538443-05:00" level=debug msg="Failed to load mount 824070e6c6437dcd6cb7dba111e6b294f8cbb477386165cf6f2aae18ec716cb5: invalid mount id value"
time="2020-01-07T16:07:48.016628599-05:00" level=debug msg="Failed to load mount 8acc1c7d16433ff26ae83ffa5a390ef76d731136501085c6ce66fa1ebf10c595: invalid mount id value"
time="2020-01-07T16:07:48.016868507-05:00" level=debug msg="Failed to load mount 9d771c8e19874f77d617335c7418baa5912a090e1ed5f208ba50faaea1f4e6bd: invalid mount id value"
time="2020-01-07T16:07:48.016882377-05:00" level=debug msg="Failed to load mount 9de56a13c2b9d5e2e0c167bba5e95a69b1181f29c00e2a0502867afa0d89a7bf: invalid mount id value"
time="2020-01-07T16:07:48.017022459-05:00" level=debug msg="Failed to load mount a504507cb86ddd1ea10ff86bd61a3b203fb19a0d1e66a1ea443e41b125a37850: invalid mount id value"
time="2020-01-07T16:07:48.017459238-05:00" level=debug msg="Failed to load mount d0499c8ceb86c9a06e6dcacadbe94bad3a9bd3d31a2ffe1f973e14df6eb3b970: invalid mount id value"
time="2020-01-07T16:07:48.018010790-05:00" level=debug msg="Max Concurrent Downloads: 3"
time="2020-01-07T16:07:48.018016142-05:00" level=debug msg="Max Concurrent Uploads: 5"
time="2020-01-07T16:07:48-05:00" level=debug msg="garbage collected" d=1.399144ms module="containerd/io.containerd.gc.v1.scheduler" 
time="2020-01-07T16:07:48.083758903-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2020-01-07T16:07:48.083902282-05:00" level=warning msg="Your kernel does not support swap memory limit"
time="2020-01-07T16:07:48.083925855-05:00" level=warning msg="Your kernel does not support cgroup rt period"
time="2020-01-07T16:07:48.083931706-05:00" level=warning msg="Your kernel does not support cgroup rt runtime"
time="2020-01-07T16:07:48.084397055-05:00" level=info msg="Loading containers: start."
time="2020-01-07T16:07:48.084582669-05:00" level=debug msg="processing event stream" module=libcontainerd namespace=moby

... hundreds of similar lines after...

time="2020-01-07T16:07:48.110453451-05:00" level=debug msg="Loaded container ffc8dc10f49a4555b9da0479b071e1da55882d73ebb78dea35f601edf8d4db2a, isRunning: false"
time="2020-01-07T16:07:49.562933661-05:00" level=debug msg="restoring container" container=8c74783e68992ed9f6ca0ecb37eec49bcc17c0069c5e7fc73cf10dd78ce57e87 paused=false running=false
time="2020-01-07T16:07:49.562959806-05:00" level=debug msg="restoring container" container=a63d3e14ecaed0de71e3ca874d3946ea5379bf000ac5c97abb0e99be1d057359 paused=false running=false
time="2020-01-07T16:07:49.562967753-05:00" level=debug msg="restoring container" container=a078311c21fbe7fca358acaade5346bf32c642519bbd4b457f5c21d05d87c508 paused=false running=false
time="2020-01-07T16:07:49.563120965-05:00" level=debug msg="restoring container" container=33f9cc2d2b5feabbfdf10888f30d9a63e72fe9d04cc0895635bef974c8f73206 paused=false running=false
time="2020-01-07T16:07:49.581404277-05:00" level=debug msg="restoring container" container=07b7f22f20eeecb9b3ed548aaa931ffaa8ca4a37094b34061d9ac41a35574f74 paused=false running=false



 ... hundreds of similar lines after...

time="2020-01-07T16:07:49.733381464-05:00" level=debug msg="restoring container" container=4f4f97ff30217dc80916eebc9c100c6d799a6a035a73c5c01990062368253ebf paused=false running=false
time="2020-01-07T16:07:49.741439388-05:00" level=debug msg="restoring container" container=39c168eb717931505c00852b38233f4f6bfc3c2b61549e1cabc212a67a729686 paused=false running=false
time="2020-01-07T16:07:49.741561635-05:00" level=debug msg="restoring container" container=b8041f2cf71dd5939ebf91a9a23ba02b4d65915e7e2160bc32391820e320602d paused=false running=false
time="2020-01-07T16:07:49.742282264-05:00" level=debug msg="Option Experimental: false"
time="2020-01-07T16:07:49.742323616-05:00" level=debug msg="Option DefaultDriver: bridge"
time="2020-01-07T16:07:49.742342572-05:00" level=debug msg="Option DefaultNetwork: bridge"
time="2020-01-07T16:07:49.742360955-05:00" level=debug msg="Network Control Plane MTU: 1500"
time="2020-01-07T16:07:49.760484779-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-ISOLATION]"
time="2020-01-07T16:07:49.762934318-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -D PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]"
time="2020-01-07T16:07:49.765664342-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL ! --dst 127.0.0.0/8 -j DOCKER]"
time="2020-01-07T16:07:49.768232163-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]"
time="2020-01-07T16:07:49.770920179-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -D PREROUTING]"
time="2020-01-07T16:07:49.773264508-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -D OUTPUT]"

... many many lines of ip routing after...

time="2020-01-07T16:07:50.011816215-05:00" level=debug msg="releasing IPv4 pools from network bridge (fbb914a007fc5a6fb39c1478503309bb03dea56479854613649657c528a44720)"
time="2020-01-07T16:07:50.011859810-05:00" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.1)"
time="2020-01-07T16:07:50.011894185-05:00" level=debug msg="Received set for ordinal 1, start 0, end 0, any false, release true, serial:false curr:0 \n"
time="2020-01-07T16:07:50.011917547-05:00" level=debug msg="ReleasePool(LocalDefault/172.17.0.0/16)"
time="2020-01-07T16:07:50.013635623-05:00" level=debug msg="cleanupServiceDiscovery for network:fbb914a007fc5a6fb39c1478503309bb03dea56479854613649657c528a44720"
time="2020-01-07T16:07:50.013670741-05:00" level=debug msg="cleanupServiceBindings for fbb914a007fc5a6fb39c1478503309bb03dea56479854613649657c528a44720"
time="2020-01-07T16:07:50.016796612-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2020-01-07T16:07:50.016872068-05:00" level=debug msg="Allocating IPv4 pools for network bridge (04075f9c3581691ac86f78e25f4432eaff14e91ecd69eb1ce93240886b78e2fe)"
time="2020-01-07T16:07:50.016902902-05:00" level=debug msg="RequestPool(LocalDefault, 172.17.0.0/16, , map[], false)"
time="2020-01-07T16:07:50.016940532-05:00" level=debug msg="Received set for ordinal 0, start 0, end 0, any false, release false, serial:false curr:0 \n"
time="2020-01-07T16:07:50.016966550-05:00" level=debug msg="Received set for ordinal 65535, start 0, end 0, any false, release false, serial:false curr:0 \n"
time="2020-01-07T16:07:50.017002299-05:00" level=debug msg="RequestAddress(LocalDefault/172.17.0.0/16, 172.17.0.1, map[RequestAddressType:com.docker.network.gateway])"
time="2020-01-07T16:07:50.017035822-05:00" level=debug msg="Received set for ordinal 1, start 0, end 0, any false, release false, serial:false curr:0 \n"
time="2020-01-07T16:07:50.017381142-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]"
time="2020-01-07T16:07:50.019760527-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -I POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]"
time="2020-01-07T16:07:50.022422442-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -C DOCKER -i docker0 -j RETURN]"
time="2020-01-07T16:07:50.024866360-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -I DOCKER -i docker0 -j RETURN]"
time="2020-01-07T16:07:50.027363703-05:00" level=debug msg="/sbin/iptables, [--wait -D FORWARD -i docker0 -o docker0 -j DROP]"
time="2020-01-07T16:07:50.029951287-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 -o docker0 -j ACCEPT]"
time="2020-01-07T16:07:50.032367948-05:00" level=debug msg="/sbin/iptables, [--wait -I FORWARD -i docker0 -o docker0 -j ACCEPT]"
time="2020-01-07T16:07:50.034909387-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 ! -o docker0 -j ACCEPT]"
time="2020-01-07T16:07:50.037318217-05:00" level=debug msg="/sbin/iptables, [--wait -I FORWARD -i docker0 ! -o docker0 -j ACCEPT]"
time="2020-01-07T16:07:50.039875811-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]"
time="2020-01-07T16:07:50.042530975-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]"
time="2020-01-07T16:07:50.045230805-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER ! --dst 127.0.0.0/8]"
time="2020-01-07T16:07:50.047785924-05:00" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER ! --dst 127.0.0.0/8]"
time="2020-01-07T16:07:50.050394918-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -j DOCKER]"
time="2020-01-07T16:07:50.052717342-05:00" level=debug msg="/sbin/iptables, [--wait -I FORWARD -o docker0 -j DOCKER]"
time="2020-01-07T16:07:50.062268276-05:00" level=debug msg="/sbin/iptables, [--wait -D FORWARD -j DOCKER-ISOLATION-STAGE-1]"
time="2020-01-07T16:07:50.064657407-05:00" level=debug msg="/sbin/iptables, [--wait -I FORWARD -j DOCKER-ISOLATION-STAGE-1]"
time="2020-01-07T16:07:50.066993453-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2]"
time="2020-01-07T16:07:50.069279455-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -I DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2]"
time="2020-01-07T16:07:50.071533312-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP]"
time="2020-01-07T16:07:50.073783603-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -I DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP]"
time="2020-01-07T16:07:50.081163012-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -n -L DOCKER-USER]"
time="2020-01-07T16:07:50.083363010-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-USER -j RETURN]"
time="2020-01-07T16:07:50.085669384-05:00" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-USER]"
time="2020-01-07T16:07:50.087828138-05:00" level=debug msg="/sbin/iptables, [--wait -D FORWARD -j DOCKER-USER]"
time="2020-01-07T16:07:50.090030604-05:00" level=debug msg="/sbin/iptables, [--wait -I FORWARD -j DOCKER-USER]"
time="2020-01-07T16:07:50.809775582-05:00" level=info msg="Loading containers: done."
time="2020-01-07T16:07:50.955485174-05:00" level=info msg="Docker daemon" commit=9ee9f40 graphdriver(s)=overlay2 version=18.03.1-ce
time="2020-01-07T16:07:50.967234275-05:00" level=debug msg="loaded node credentials" module=node/tls node.id=k55prdk5rxorfnwu5j6xpjpp1 node.role=swarm-manager
time="2020-01-07T16:07:50.974618962-05:00" level=debug msg="(*Agent).run" module=node/agent node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:50.974758665-05:00" level=debug msg="next certificate renewal scheduled for 691h39m9.025246469s from now" module=node/tls node.id=k55prdk5rxorfnwu5j6xpjpp1 node.role=swarm-manager time="2020-02-05 11:47:00.000000137 -0500 EST m=+2489952.067407376"
time="2020-01-07T16:07:50.976189332-05:00" level=info msg="Listening for connections" addr="[::]:2377" module=node node.id=k55prdk5rxorfnwu5j6xpjpp1 proto=tcp
time="2020-01-07T16:07:50.976195878-05:00" level=info msg="Listening for local connections" addr=/var/run/docker/swarm/control.sock module=node node.id=k55prdk5rxorfnwu5j6xpjpp1 proto=unix
2020-01-07 16:07:50.976335 W | wal: ignored file 0000000000000014-00000000000cdd34.wal.broken in wal
time="2020-01-07T16:07:51.046891166-05:00" level=info msg="20ef9e0a7b6041dc became follower at term 117" module=raft node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.046928313-05:00" level=info msg="newRaft 20ef9e0a7b6041dc [peers: [20ef9e0a7b6041dc], term: 117, commit: 2194765, applied: 2193235, lastindex: 2194765, lastterm: 117]" module=raft node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.050413757-05:00" level=info msg="manager selected by agent for new session: {k55prdk5rxorfnwu5j6xpjpp1 192.168.1.148:2377}" module=node/agent node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.050458302-05:00" level=error msg="error removing task " error="bucket not found" module=node/agent/worker node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.050469991-05:00" level=error msg="error removing task " error="bucket not found" module=node/agent/worker node.id=k55prdk5rxorfnwu5j6xpjpp1



.... many hundreds of similar debug lines after..


time="2020-01-07T16:07:51.067790438-05:00" level=error msg="error removing task " error="bucket not found" module=node/agent/worker node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.067801184-05:00" level=error msg="error removing task " error="bucket not found" module=node/agent/worker node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.067809512-05:00" level=error msg="error removing task " error="bucket not found" module=node/agent/worker node.id=k55prdk5rxorfnwu5j6xpjpp1
time="2020-01-07T16:07:51.089348209-05:00" level=debug msg="(*Agent).UpdateTaskStatus" module=node/agent node.id=k55prdk5rxorfnwu5j6xpjpp1 task.id=0o21qh02tutcemqcxp9a1pnfp
time="2020-01-07T16:07:51.169918243-05:00" level=debug msg="transport: add peer 5db22c04be0bcbfc with address 192.168.1.113:2377"
time="2020-01-07T16:07:51.174166324-05:00" level=debug msg="stop transport"
panic: Copy for *types.BoolValue not implemented

goroutine 499 [running]:
github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy.Copy(0x55a4645b9e20, 0xc422b78ff8, 0x55a4645b9e20, 0xc422b78fd6)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy/copy.go:50 +0x3cb
github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*ContainerSpec).CopyFrom(0xc421b2c2c0, 0x55a46457e100, 0xc421b2c160)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/specs.pb.go:1134 +0x957
github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy.Copy(0x55a46457e100, 0xc421b2c2c0, 0x55a46457e100, 0xc421b2c160)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy/copy.go:48 +0x251
github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*TaskSpec).CopyFrom(0xc420956058, 0x55a464607460, 0xc420959d18)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/specs.pb.go:1026 +0x277
github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy.Copy(0x55a464607460, 0xc420956058, 0x55a464607460, 0xc420959d18)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy/copy.go:48 +0x251
github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*ServiceSpec).CopyFrom(0xc420956028, 0x55a4645c5940, 0xc420959ce8)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/specs.pb.go:905 +0xd4
github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy.Copy(0x55a4645c5940, 0xc420956028, 0x55a4645c5940, 0xc420959ce8)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/deepcopy/copy.go:48 +0x251
github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*Service).CopyFrom(0xc420956000, 0x55a464619560, 0xc420959cc0)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/objects.pb.go:435 +0xcd
github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*Service).Copy(0xc420959cc0, 0x19)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/objects.pb.go:426 +0x62
github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*Service).CopyStoreObject(0xc420959cc0, 0x55a463bad7c9, 0x7)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/objects.pb.go:2323 +0x2d
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store.(*tx).create(0xc421df22d0, 0x55a463bad7c9, 0x7, 0x55a465165ac0, 0xc420959cc0, 0xc422b78fa7, 0x9)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store/memory.go:557 +0xca
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store.CreateService(0x55a465164020, 0xc421df22d0, 0xc420959cc0, 0x28, 0xc421df22d0)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store/services.go:93 +0xfb
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store.init.8.func3(0x55a465164020, 0xc421df22d0, 0xc400000001, 0x55a4651583a0, 0xc422596b70, 0x55a46513c2a0, 0xc420400320)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store/services.go:73 +0x12a
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store.applyStoreAction(0x55a465164020, 0xc421df22d0, 0xc400000001, 0x55a4651583a0, 0xc422596b70, 0xc4221a3350, 0x150)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store/memory.go:310 +0xa9
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store.(*MemoryStore).ApplyStoreActions(0xc4202b10c0, 0xc421b2e9c0, 0x1, 0x1, 0xc421b2e900, 0x0)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/store/memory.go:289 +0x165
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).processEntry(0xc420424a00, 0x7f371c8c35c0, 0xc420ff87c0, 0x75, 0x217d4d, 0x0, 0xc421646160, 0x150, 0x150, 0x0, ...)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:1906 +0x131
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).processCommitted(0xc420424a00, 0x7f371c8c35c0, 0xc420ff87c0, 0x75, 0x217d4d, 0x0, 0xc421646160, 0x150, 0x150, 0x0, ...)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:1867 +0x139
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft.(*Node).Run(0xc420424a00, 0x7f371c8c35c0, 0xc420ff87c0, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/state/raft/raft.go:655 +0xfbd
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).Run.func6(0xc4202a3d40, 0x7f371c8c35c0, 0xc4202b1900)
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:546 +0x4c
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).Run
    /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:545 +0x167e

Is there a possibility to recover the node A swarm? I am not sure how to proceed

# docker version
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
lurscher
  • 25,930
  • 29
  • 122
  • 185

1 Answers1

2

I'm afraid the situation you're in is quite critical.

To understand why it happened, you need to know how Docker Swarm works and how the Raft consesus works.

Swarm use Raft to manage all the manager of your cluster. Here is a real quick summary of what it means :
Each time you add/remove a manager, there is an election between all the manager to elect the leader. The node that receives the majority of the vote first become the leader.

Now in you case, you had one manager which was the leader. You added another one, thus a new election took place and one of your node became the new leader. So you had only two managers, and this is a really bad configuration.
You lost node B from the cluster (probably when you deleted /var/lib/docker) and as you just lost a manager, there is a new election going on. So node A "candidate" itself to be the leader, however, the cluster think there is two managers, and thus node A is waiting for a vote from the other node. As you can guess, the answer will never come because B is down, and so is the cluster as long as there is no leader.

That is (partially) why you need to have an even number of manager and why having two of them is a bad idea. In fact, even docker warns you about that indirectly :

You can use the --force option on a manager to remove it from the swarm. However, this does not reconfigure the swarm to ensure that there are enough managers to maintain a quorum in the swarm.

Here using --force or loosing the manager node results in the same situation.

So I don't really know if there is a solution to your problem but I feel like there isn't one (I had the same issue and had to re-install the daemon).

Hope this helps you in the future !

Marc ABOUCHACRA
  • 3,155
  • 12
  • 19
  • hi Marc, thank you so much for sharing your experience. Actually I think node A won the election before the failure, as `docker node ls` showed A as `Leader` and node B as `Reachable`. – lurscher Jan 10 '20 at 15:17
  • Even if node A won the election, it was before node B failure. The moment node B failed, a new election was triggered. – Marc ABOUCHACRA Jan 10 '20 at 15:26
  • seems a weird protocol, one should expect an election should proceed only with reachable nodes,.. oh well, at least I have not-too-old backups – lurscher Jan 10 '20 at 15:32