0

I had setup a working Docker Swarm cluster, but after several months I am trying to get back to using this cluster and I noticed nothing works.

Upon troubleshooting to find out what is going on, I found this error.

 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: error
  NodeID: 
  Error: error while loading TLS certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: certificate (1 - s3htdkgcv9qifg2jmbpud1gt7) not valid after Sun, 27 Mar 2022 10:27:00 UTC, and it is currently Sun, 19 Jun 2022 04:33:54 UTC: x509: certificate has expired or is not yet valid: 
  Is Manager: false
  Node Address: 10.10.1.10

I have tried what I found online like here https://stackoverflow.com/a/59086699/5442187

docker swarm leave

and then tried to rejoin

docker swarm join-token manager

=>

Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

And

docker swarm join-token worker

=>

Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

How do I re-join/re-claim this cluster back? I will expect it should be possible else this will make Docker Swarm a no go for production.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
uberrebu
  • 3,597
  • 9
  • 38
  • 73
  • `join` <> `join-token`, did you mean `docker swarm join --token ...`? – Mark Rotteveel Jun 19 '22 at 10:16
  • normally that command prints the value of token to then use in next command `docker swarm join --token ...` else no way to know value of token to use to recover the cluster – uberrebu Jun 19 '22 at 10:29
  • To be clear, I'm only going by the error message itself. I haven't used Docker Swarm. However, I assume that if you want to get the token, you need to enter that command on a node that is still in the swarm, not on the node that has left the swarm/is not in the swarm. – Mark Rotteveel Jun 19 '22 at 10:31
  • Do you have a separate swarm manager that's working? Please include the list of nodes in your cluster, role of each node, which have errors, and where you're running commands. – BMitch Jun 19 '22 at 11:52
  • there were just 2 nodes in the cluster and all of them it says manager false, commands are ran on both nodes and none of them works – uberrebu Jun 19 '22 at 18:15
  • 1
    Were the nodes off for an extended period of time? In the normal course of events in a healthy swarm the manager nodes refresh their certificates as they go. You a. let the swarm get into an unhealthy state and b. didn't do anything about it until the certificates had expired. At this point, recovery was not an option as there were no healthy nodes left with valid certificates. In production, this is not a no-go, as, the circumstances that led to this are incompatible with production - that is, having a down system that remains unfixed for months. – Chris Becke Jun 20 '22 at 10:19
  • these issue of not being able to join cluster is very common even for currently running clusters...docker swarm is VERY buggy..i just started a brand new cluster and tried to promote some workers to managers and one of the nodes was down; tried to re-join and it wont rejoin..this cant be something to be confident with in production – uberrebu Jun 21 '22 at 08:17
  • @ChrisBecke also the PROD comment is just about being able to fix an issue if something breaks. I should be able to put this cluster back together without losing all data or having to create new cluster. There needs to be a route to be able to recover clusters in an easy, non-rocket-science way, else this is playing russian roulette – uberrebu Jun 21 '22 at 13:22

3 Answers3

0

Rotate the swarm CA via docker swarm ca --rotate.

The root CA rotation will not be completed until all registered nodes have rotated their TLS certificates. If the rotation is not completing within a reasonable amount of time, try running docker node ls --format '{{.ID}} {{.Hostname}} {{.Status}} {{.TLSStatus}}' to see if any nodes are down or otherwise unable to rotate TLS certificates.

See https://docs.docker.com/engine/reference/commandline/swarm_ca/

Dima Korobskiy
  • 1,479
  • 16
  • 26
0

there were just 2 nodes in the cluster and all of them it says manager false, commands are ran on both nodes and none of them works

Once all managers have left the cluster, I believe it is gone. Before then you could have run the following on one of the managers:

docker swarm init --force-new-cluster

Now that they've all left, you can recreate the cluster from scratch:

# on the manager
docker swarm init

Once you have a new cluster, on the manager run:

docker swarm join-token manager # or worker

Then run the output of the join-token command above on the other nodes to join to the cluster.

BMitch
  • 231,797
  • 42
  • 475
  • 450
  • how does this recover old cluster? i need to recover all the services/stacks running from old cluster...this is like starting brand new cluster; this is not an answer – uberrebu Jul 19 '22 at 13:57
  • @uberrebu Force new cluster would would reuse what it can from the existing state, but if you no longer have any managers, there's no existing state to recover, it's gone. – BMitch Jul 19 '22 at 14:08
  • then seems no solution to this then as this can not be recoverable – uberrebu Jul 19 '22 at 14:10
0

There's a way to recover, without losing the deployed swarm services/stacks. The error was complaining "certificate not valid after Sun, 27 Mar 2022 10:27:00 UTC". So we should let the certificate valid first, then recover the swarm services, and rotate the CA certificate when swarm is up and running:

  1. stop docker service:

    service docker stop

  2. set date back to "27 Mar 2022 10:27:00", could be more earlier:

    date -s "27 Mar 2022 10:27:00"

  3. Bring up the swarm services:

    service docker start

    #check if all the services are up and running

    docker stack ls

  4. Rotate the certificate:

    docker swarm ca –rotate

  5. Set system date to current:

    date -s "19 Apr 2023 06:34:00"

Qiushi
  • 153
  • 1
  • 6