2

Restarting a windows server that is a swarm worker, causes windows containers to get stuck in a "Preparing" state indefinitely once the server and docker daemon are back online.

Image of tasks/containers stuck in preparing state:

https://user-images.githubusercontent.com/4528753/65180353-4e5d6e80-da22-11e9-8060-451150865177.png

Steps to reproduce the issue:

  1. Create a swarm (in my case I have CentOS7 managers, and a few windows server 1903 workers)
  2. Create a "global" docker service that only runs on the windows machines. They should start up fine initially and work just fine.
  3. Drain one or more of the windows nodes that is running the windows container(s) from step 2 (docker node update --availability=drain nodename)
  4. Restart one or more of the nodes that were drained in step 3, wait for them to come back up
  5. Set the windows node(s) back to active (docker node update --availability=active nodename)
  6. At this point, just observe that the docker service created in step 2 will be "Preparing" the containers to start up on these nodes, and there it will stay (docker service ps servicename --no-trunc) -- you can observe this and run these commands from any master node
memberlist: Refuting a suspect message (from: c9347e85405d)
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
          context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0  <nil>}. Err :connection error: desc = "transport: Error while
          dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
          context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0  <nil>}. Err :connection error: desc = "transport: Error while
          dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
agent: session failed [node.id=wuhifvg9li3v5zuq2xu7c6hxa module=node/agent error=rpc error: code = Unavailable desc = all SubConns are
          in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.60.3.69:2377:
          connectex: A socket operation was attempted to an unreachable host." backoff=6.3s]
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.109: write udp 10.60.3.40:7946->10.60.3.109:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
          context.
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
          context.
memberlist: Failed to send gossip to 10.60.3.105:7946: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is
          not valid in its context.
memberlist: Failed to send gossip to 10.60.3.186:7946: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is
          not valid in its context.

Many of these errors are odd, for example... 7946 is totally open between the cluster nodes, telnets confirm this.

I expect to see the docker service containers start promptly, and not stuck in a Preparing state. The docker image is already pulled, it should be fast.

docker version output

Client: Docker Engine - Enterprise
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        c92ab06ed9
 Built:             09/03/2019 16:38:11
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.8
  Git commit:       c92ab06ed9
  Built:            09/03/2019 16:35:47
  OS/Arch:          windows/amd64
  Experimental:     false

docker info output

Client:
 Debug Mode: false
 Plugins:
  cluster: Manage Docker clusters (Docker Inc., v1.1.0-8c33de7)

Server:
 Containers: 4
  Running: 0
  Paused: 0
  Stopped: 4
 Images: 4
 Server Version: 19.03.2
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics l2bridge l2tunnel nat null overlay transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: active
  NodeID: wuhifvg9li3v5zuq2xu7c6hxa
  Is Manager: false
  Node Address: 10.60.3.40
  Manager Addresses:
   10.60.3.110:2377
   10.60.3.186:2377
   10.60.3.69:2377
 Default Isolation: process
 Kernel Version: 10.0 18362 (18362.1.amd64fre.19h1_release.190318-1202)
 Operating System: Windows Server Datacenter Version 1903 (OS Build 18362.356)
 OSType: windows
 Architecture: x86_64
 CPUs: 4
 Total Memory: 8GiB
 Name: SWARMWORKER1
 ID: V2WJ:OEUM:7TUQ:WPIO:UOK4:IAHA:KWMN:RQFF:CAUO:LUB6:DJIJ:OVBX
 Docker Root Dir: E:\docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: this node is not a swarm manager - check license status on a manager node

Additional Details

  • These nodes are not using Docker Desktop for windows. I am provisioning docker on the box primarily based on the powershell instructions here: https://docs.docker.com/install/windows/docker-ee/
  • Windows firewall is disabled
  • iptables/firewalld is disabled
  • Communication is completely open between the cluster nodes
  • Totally up-to-date on cumulative updates

I posted on the moby repo issues but never heard a peep: https://github.com/moby/moby/issues/39955

The ONLY way I've found to temporarily fix the issue is to drain the node from the swarm, delete docker files, reinstall windows "Containers" feature, then rejoin to the swarm. But, it happens again on reboot.

What's interesting is that when I see a swarm task in a "Preparing" state on the windows worker, the server doesn't seem to be doing anything at all, it's like the manager thinks the worker is preparing the container, but it isn't...

Anyone have any suggestions??

Community
  • 1
  • 1
drewster
  • 61
  • 6
  • did you try to reboot docker service when it happens? – rok Oct 10 '19 at 13:11
  • Yeah, I mean, a node drain/active will cause the problem. A node reboot will cause it, a docker service reboot causes it. Very strange. – drewster Oct 11 '19 at 15:41

0 Answers0