1

Following is my architecture

2 Servers:
Server 1: running Apache Nifi + Zookeeper (Not embedded)
Server 2: running Apache Nifi + Zookeeper (Not embedded)

To test failovers, I close down the Server that has been selected as Cluster Coordinator

In this case, zookeeper should automatically elect the remaining one server as leader. But it keeps failing and goes into continuous loop of trying to connect to the first server

Zookeeper Logs in Server 2 when leader (Server 1) went down:

2019-10-22 18:44:01,135 [myid:2] - WARN  [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:02,925 [myid:2] - WARN  [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:03,320 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@677] - 
Cannot open channel to 1 at election address ec2-server-1.compute-1.amazonaws.com/172.xx.x.x:3888
    java.net.ConnectException: Connection refused (Connection refused)
            at java.net.PlainSocketImpl.socketConnect(Native Method)
            at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

Server 2 Config files:

zoo.cfg

tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=ec2-server-1.compute-1.amazonaws.com:2888:3888
server.2=0.0.0.0:2888:3888

nifi.properties

nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-2.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1

# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi

Server 1 Config files:

zoo.cfg

tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=0.0.0.0:2888:3888
server.2=ec2-server-2.compute-1.amazonaws.com:2888:3888

nifi.properties

nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-1.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1

# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi

What am I doing wrong?

Jaskaran Singh Puri
  • 729
  • 2
  • 11
  • 37

1 Answers1

1

You need at least three nodes to be able to handle the failure of one node.

Check the Admin guide:

Clustered (Multi-Server) Setup For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.

A simpler explanation here also

  • Just one confusion, When out of 3 nodes, 1 goes down, wont the 2 servers conflict again as to who will become the leader because there's no majority even in this case also? – Jaskaran Singh Puri Oct 24 '19 at 06:22
  • I cannot say exactly about zookeeper election but most consensus use som sort of random voting, so the nodes would not always chose themselves as leader, it may happens but the probability that they random select would solve the conflict on a re-election –  Oct 24 '19 at 07:15