I have consul cluster in production which is integrated with terraform, ansible,nomad,docker,vault. My consul keeps looking for the dead node which was part of initial setup even after removing entry from peers.json
of the raft. Below is my consul log.
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
Node name: 'ip-10-10-2-49'
Datacenter: 'us-east-1'
Server: true (bootstrap: false)
Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
Cluster Addr: 10.10.2.49 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
Atlas: <disabled>
==> Log data will now stream in as it occurs:
2017/01/24 13:52:59 [INFO] raft: Restored from snapshot 1596-4460452-1485234581011
2017/01/24 13:52:59 [INFO] raft: Node at 10.10.2.49:8300 [Follower] entering Follower state
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-2-49 10.10.2.49
2017/01/24 13:52:59 [INFO] serf: Attempting re-join to previously known node: ip-10-0-1-206: 10.0.1.206:8301
2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-2-49 (Addr: 10.10.2.49:8300) (DC: us-east-1)
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-4-149 10.10.4.149
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-1-10 10.10.1.10
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-0-3-119 10.0.3.119
2017/01/24 13:52:59 [WARN] memberlist: Refuting an alive message
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-3-84 10.10.3.84
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-0-1-206 10.0.1.206
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-1-252 10.10.1.252
2017/01/24 13:52:59 [INFO] serf: Re-joined to previously known node: ip-10-0-1-206: 10.0.1.206:8301
2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-1-10 (Addr: 10.10.1.10:8300) (DC: us-east-1)
2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-3-84 (Addr: 10.10.3.84:8300) (DC: us-east-1)
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-2-49.us-east-1 10.10.2.49
2017/01/24 13:52:59 [INFO] consul: adding WAN server ip-10-10-2-49.us-east-1 (Addr: 10.10.2.49:8300) (DC: us-east-1)
2017/01/24 13:52:59 [WARN] serf: Failed to re-join any previously known node
2017/01/24 13:52:59 [INFO] agent: Joining cluster...
2017/01/24 13:52:59 [ERR] agent: failed to sync remote state: No cluster leader
2017/01/24 13:52:59 [INFO] agent: (LAN) joining: [consul-1.example-private.com consul-2.example-private.com consul-3.example-private.com]
2017/01/24 13:52:59 [INFO] agent: (LAN) joined: 3 Err: <nil>
2017/01/24 13:52:59 [INFO] agent: Join completed. Synced with 3 initial agents
2017/01/24 13:53:01 [WARN] raft: Heartbeat timeout reached, starting election
2017/01/24 13:53:01 [INFO] raft: Node at 10.10.2.49:8300 [Candidate] entering Candidate state
2017/01/24 13:53:01 [INFO] raft: Election won. Tally: 3
2017/01/24 13:53:01 [INFO] raft: Node at 10.10.2.49:8300 [Leader] entering Leader state
2017/01/24 13:53:01 [INFO] consul: cluster leadership acquired
2017/01/24 13:53:01 [INFO] consul: New leader elected: ip-10-10-2-49
2017/01/24 13:53:01 [INFO] raft: pipelining replication to peer 10.10.3.84:8300
2017/01/24 13:53:01 [INFO] raft: pipelining replication to peer 10.10.1.10:8300
2017/01/24 13:53:01 [WARN] raft: Failed to contact 10.10.1.23:8300 in 501.633573ms
2017/01/24 13:53:02 [INFO] agent: Synced node info
2017/01/24 13:53:02 [WARN] raft: Failed to contact 10.10.1.23:8300 in 961.388392ms
2017/01/24 13:53:02 [WARN] raft: Failed to contact 10.10.1.23:8300 in 1.42262185s
2017/01/24 13:53:11 [ERR] raft: Failed to make RequestVote RPC to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
2017/01/24 13:53:11 [ERR] raft: Failed to AppendEntries to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
2017/01/24 13:53:11 [ERR] raft: Failed to heartbeat to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
I tried to remove the entry of dead node 10.10.1.23
from peers.json and started it all again but it keeps looking for same dead node. Can someone guide me how to kick out the dead node. I tried all the basic commands outlined in consul document to kick this particular node out however after the restart of service it starts appearing in logs.