My application runs over a Kubernetes cluster of 3 nodes and uses Kafka to stream data. I am trying to check my system's ability to recover from node failure, so I deliberately fail one of the nodes for 1 minute.
Around 50% of the times, I experience data loss of a single data record after the node failure. If the controller Kafka broker was running on the failed node, I see that a new controller broker was elected as expected. When the data loss occur, I see the following error in the new controller broker log:
ERROR [Controller id=2 epoch=13] Controller 2 epoch 13 failed to change state for partition __consumer_offsets-45 from OfflinePartition to OnlinePartition (state.change.logger) [controller-event-thread]
I am not sure if that's the problem, but searching the web for information about this error made me suspect that I need to configure Kafka to have more than 1 replica for each topic.
This is how my topics/partitions/replicas configuration looks like:
My questions: Is my suspicion that more replicas are required is correct?
If yes, how do I increase the number of topics replicas? I played around with a few broker parameters such as default.replication.factor
and replication.factor
but I did not see the number of replicas change.
If no, what is the meaning of this error log?
Thanks!