0

I have consul cluster in production which is integrated with terraform, ansible,nomad,docker,vault. My consul keeps looking for the dead node which was part of initial setup even after removing entry from peers.json of the raft. Below is my consul log.

==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'ip-10-10-2-49'
        Datacenter: 'us-east-1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.10.2.49 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
             Atlas: <disabled>

==> Log data will now stream in as it occurs:

    2017/01/24 13:52:59 [INFO] raft: Restored from snapshot 1596-4460452-1485234581011
    2017/01/24 13:52:59 [INFO] raft: Node at 10.10.2.49:8300 [Follower] entering Follower state
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-2-49 10.10.2.49
    2017/01/24 13:52:59 [INFO] serf: Attempting re-join to previously known node: ip-10-0-1-206: 10.0.1.206:8301
    2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-2-49 (Addr: 10.10.2.49:8300) (DC: us-east-1)
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-4-149 10.10.4.149
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-1-10 10.10.1.10
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-0-3-119 10.0.3.119
    2017/01/24 13:52:59 [WARN] memberlist: Refuting an alive message
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-3-84 10.10.3.84
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-0-1-206 10.0.1.206
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-1-252 10.10.1.252
    2017/01/24 13:52:59 [INFO] serf: Re-joined to previously known node: ip-10-0-1-206: 10.0.1.206:8301
    2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-1-10 (Addr: 10.10.1.10:8300) (DC: us-east-1)
    2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-3-84 (Addr: 10.10.3.84:8300) (DC: us-east-1)
    2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-2-49.us-east-1 10.10.2.49
    2017/01/24 13:52:59 [INFO] consul: adding WAN server ip-10-10-2-49.us-east-1 (Addr: 10.10.2.49:8300) (DC: us-east-1)
    2017/01/24 13:52:59 [WARN] serf: Failed to re-join any previously known node
    2017/01/24 13:52:59 [INFO] agent: Joining cluster...
    2017/01/24 13:52:59 [ERR] agent: failed to sync remote state: No cluster leader
    2017/01/24 13:52:59 [INFO] agent: (LAN) joining: [consul-1.example-private.com consul-2.example-private.com consul-3.example-private.com]
    2017/01/24 13:52:59 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2017/01/24 13:52:59 [INFO] agent: Join completed. Synced with 3 initial agents
    2017/01/24 13:53:01 [WARN] raft: Heartbeat timeout reached, starting election
    2017/01/24 13:53:01 [INFO] raft: Node at 10.10.2.49:8300 [Candidate] entering Candidate state
    2017/01/24 13:53:01 [INFO] raft: Election won. Tally: 3
    2017/01/24 13:53:01 [INFO] raft: Node at 10.10.2.49:8300 [Leader] entering Leader state
    2017/01/24 13:53:01 [INFO] consul: cluster leadership acquired
    2017/01/24 13:53:01 [INFO] consul: New leader elected: ip-10-10-2-49
    2017/01/24 13:53:01 [INFO] raft: pipelining replication to peer 10.10.3.84:8300
    2017/01/24 13:53:01 [INFO] raft: pipelining replication to peer 10.10.1.10:8300
    2017/01/24 13:53:01 [WARN] raft: Failed to contact 10.10.1.23:8300 in 501.633573ms
    2017/01/24 13:53:02 [INFO] agent: Synced node info
    2017/01/24 13:53:02 [WARN] raft: Failed to contact 10.10.1.23:8300 in 961.388392ms
    2017/01/24 13:53:02 [WARN] raft: Failed to contact 10.10.1.23:8300 in 1.42262185s
    2017/01/24 13:53:11 [ERR] raft: Failed to make RequestVote RPC to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
    2017/01/24 13:53:11 [ERR] raft: Failed to AppendEntries to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
    2017/01/24 13:53:11 [ERR] raft: Failed to heartbeat to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout

I tried to remove the entry of dead node 10.10.1.23 from peers.json and started it all again but it keeps looking for same dead node. Can someone guide me how to kick out the dead node. I tried all the basic commands outlined in consul document to kick this particular node out however after the restart of service it starts appearing in logs.

Shailesh Sutar
  • 1,517
  • 5
  • 23
  • 41

1 Answers1

1

To remove a dead node that's still in the cluster state, you can call consul force-leave <node name> from a live node. This will put that node in the "left" state in the cluster, which is what should have happened if the node had left the cluster gracefully.

Adrian
  • 153
  • 1
  • 9