In this post, there is the approved comment with the following statement:
Cluster takes this to the next level by using a quorum agreement to prevent message loss in the case of node failure.
I'm testing the delivery in case of one cluster node failure but from my observation, the messages can get lost in case of node failure.
I'm using io.aeron.samples.cluster.tutorial.BasicAuctionClusterClient
from aeron code base together with io.aeron.samples.cluster.tutorial.BasicAuctionClusterClient
(version 1.38.1)
I did a small adjustment in BasicAuctionClusterClient
to see whether the message was received or not:
public void onSessionMessage(
final ClientSession session,
final long timestamp,
final DirectBuffer buffer,
final int offset,
final int length,
final Header header)
{
final long correlationId = buffer.getLong(offset + CORRELATION_ID_OFFSET); // <1>
System.out.println("Received message with correlation ID " + correlationId); // this line is added
// the rest is the same
}
When I start the cluster with 3 nodes, 1 of them is elected as LEADER
. Then I start the BasicAuctionClusterClient
which starts to send requests to the cluster.
When I stop the leader, the new one is elected as expected but the messages from this point in time to a new leader election never reach the cluster (see the gap in correlation ID bellow).
New role is LEADER
Received message with correlation ID -8046281870845246166
attemptBid(this=Auction{bestPrice=144, currentWinningCustomerId=1}, price=152,customerId=1)
Received message with correlation ID -8046281870845246165
attemptBid(this=Auction{bestPrice=152, currentWinningCustomerId=1}, price=158,customerId=1)
Consensus Module
io.aeron.cluster.client.ClusterEvent: WARN - leader heartbeat timeout
Received message with correlation ID -8046281870845246154
attemptBid(this=Auction{bestPrice=158, currentWinningCustomerId=1}, price=167,customerId=1)
What is expected from the developer to do in case they want to have the delivery (processing) guaranteed? Is it expected to have custom made ack system with retries and duplicate requests handling on cluster node's side?