0

[org.jgroups.protocols.pbcast.NAKACK] (requester=, local_addr=) message ::port not found in retransmission table of :port: (size=xxxx, missing=x, highest stability=xxxxx)]

suresh
  • 3
  • 1
  • 8

1 Answers1

0

NAKACK (or its newer cousin, NAKACK2) provide reliable transmission of messages to the cluster. To do this, every messages gets a sequence number (seqno) and receivers deliver the message to the application in seqno order.

Every cluster member has a table of all other members and their messages (conceptually a list). When member P sends messages P21, P22 and P23, a receiver R first looks up the message list for R, then adds P21-P23 to the list.

However, in your case, the list for R was not found. This means that R was not a cluster member (anymore).

For example, if we have cluster {P,Q,R,T}, and member R leaves or is excluded because it was suspected (e.g. we didn't receive a heartbeat for a period of time), then messages P21-23 will be dropped by any receiver.

This is because JGroups only allows cluster members to send and receive messages.

How can a member get excluded?

This is likely done by on of the failure detection protocols (e.g. FD_ALL or FD).

Another possibility is that your thread pools were clogged and failure detection heartbeat messages were dropped, leading to false suspicions.

Also, long GC pauses can cause this.

Fixes:

  • Increase the timeouts in FD_ALL or FD. The timeout should be longer than the longest GC cycle. Note that it will now take longer to detect hung members.
  • Size your thread pools, e.g. make sure that the max number of threads are big and the queue is disabled.

Note that false suspicions can happen, but MERGE3 should rememdy a split cluster later on.

Bela Ban
  • 2,186
  • 13
  • 12