How do I avoid data loss when using Kafka over Kubernetes and one of the nodes fail?

Question

My application runs over a Kubernetes cluster of 3 nodes and uses Kafka to stream data. I am trying to check my system's ability to recover from node failure, so I deliberately fail one of the nodes for 1 minute.

Around 50% of the times, I experience data loss of a single data record after the node failure. If the controller Kafka broker was running on the failed node, I see that a new controller broker was elected as expected. When the data loss occur, I see the following error in the new controller broker log:

ERROR [Controller id=2 epoch=13] Controller 2 epoch 13 failed to change state for partition __consumer_offsets-45 from OfflinePartition to OnlinePartition (state.change.logger) [controller-event-thread]

I am not sure if that's the problem, but searching the web for information about this error made me suspect that I need to configure Kafka to have more than 1 replica for each topic. This is how my topics/partitions/replicas configuration looks like:

My questions: Is my suspicion that more replicas are required is correct?

If yes, how do I increase the number of topics replicas? I played around with a few broker parameters such as default.replication.factor and replication.factor but I did not see the number of replicas change.

If no, what is the meaning of this error log?

Thanks!

score 2 · Answer 1 · answered Apr 12 '22 at 13:46

2

Yes, if the broker hosting the single replica goes down, then you can expect an unclean topic. If you have unclean leader election disabled, however, you shouldn't lose data that's already been persisted to the broker.

To modify existing topics, you must use kafka-reassign-partitions tool, not any of the broker settings, as those only apply for brand new topics. Kafka | Increase replication factor of multiple topics

Ideally, you should disable auto topic creation, as well, to force clients to use Topic CRD resources in Strimzi that include a replication factor, and you can use other k8s tools to verify that they have values greater than 1.

answered Apr 12 '22 at 13:46

OneCricketeer

179,855
19
132
245

Hi OneCricketeer, Thank you for the quick response. I did not configure unclean.leader.election.enable, so it should be disabled (default). The lost data record is indeed not committed to the last topic in my data pipeline, so I guess that it's not a case of persisted data being lost. Also, the lost data record is always one that been inserted to the pipeline short time after the node failure. So it makes sense in that regard. – Noam Apr 12 '22 at 14:25
For my case it's okay to configure via the broker settings because I am creating new topics on every run. Nevertheless, thanks for enlightening me about the kafka-reassign-partition tool as it may be useful for me in the future,. – Noam Apr 12 '22 at 14:25
When you create topics on your own, you also are required to set replication factor. That is separate from the broker config – OneCricketeer Apr 12 '22 at 14:26
You will need to modify the topics that start with underscores, `connect-` and control center ones on your own, too – OneCricketeer Apr 12 '22 at 14:27
For a start, I would like to set the number of replicas for topics that start with `s1`, `dst`, and `c-` (for internal reasons of my specific application). – Noam Apr 12 '22 at 14:30
Below is a snippet of my current broker configuration. Perhaps you can explain what should be changed in order to achieve 3 replicas per topic? `config: replication.factor: 3 default.replication.factor: 3 log.message.format.version: "2.8" inter.broker.protocol.version: "2.8" auto.create.topics.enable: false` – Noam Apr 12 '22 at 14:33
That is fine, but like I said, it will not update existing topics – OneCricketeer Apr 12 '22 at 15:35
After configuring all topics to have 3 replicas I still experience data loss. Any idea what else could be the reason for that? Maybe some other configuration? How about `min.insync.replicas`? – Noam Apr 14 '22 at 11:15
Is data lost from the producer before/after an ack? Or lost from the broker after a pod restarts? Min ISR should be higher than one, and less than or equal to the replication factor, yes. You've also not clarified how your persistent volumes are configured (if at all) – OneCricketeer Apr 14 '22 at 15:21
Looks like the source connector fails to read the lost data from the S3 bucket (I am working with an S3 sink connector). It happens right after the node failure. Is there a way to configure the sink connector to retry reading the data in case of failure? – Noam Apr 18 '22 at 09:28
I'm not sure what you mean. The S3 sink connector doesn't perform S3 reads. Data is only tracked from the topic itself – OneCricketeer Apr 18 '22 at 14:55
When I read the data that was processed by the topic I see that an event is missing. The missing event status is "SCHEDULED", then "FAILED" and finally "CLEANED". – Noam Apr 19 '22 at 08:27
Sorry, I don't know what these values mean. Those aren't Kafka event statuses – OneCricketeer Apr 19 '22 at 14:35
Maybe these logs are more helpful (file-00036.in is the lost data): WARN Failed to load committed offset for object file s3://metro-bucket-221hsbomum/in/file-00036.in. Previous offset will be ignored. Error: Failed to fetch offsets. (io.streamthoughts.kafka.connect.filepulse.source.DefaultFileRecordsPollingConsumer) [task-thread-s1-ilcera-0] INFO Opening new iterator for: s3://metro-bucket-221hsbomum/in/file-00036.in (io.streamthoughts.kafka.connect.filepulse.source.DelegateFileInputIterator) [task-thread-s1-ilcera-0] – Noam Apr 20 '22 at 06:54
And then: ERROR Failed to get object metadata from Amazon S3. Error occurred while making the request or handling the response for s3://metro-bucket-221hsbomum/in/file-00036.in: {} (io.streamthoughts.kafka.connect.filepulse.fs.AmazonS3Storage) [task-thread-s1-ilcera-0] ERROR Failed to open and initialize new iterator for object-file: s3://metro-bucket-221hsbomum/in/file-00036.in. (io.streamthoughts.kafka.connect.filepulse.source.DefaultFileRecordsPollingConsumer) [task-thread-s1-ilcera-0] – Noam Apr 20 '22 at 06:55
And finally: ERROR Error while processing source file '[uri=s3://metro-bucket-221hsbomum/in/file-00036.in, name='null', contentLength=null, lastModified=null, contentDigest=null, userDefinedMetadata={}]' (io.streamthoughts.kafka.connect.filepulse.source.FileObjectStateReporter) [task-thread-s1-ilcera-0] io.streamthoughts.kafka.connect.filepulse.reader.ReaderException: Failed to create BytesArrayInputIterator for: s3://metro-bucket-221hsbomum/in/file-00036.in – Noam Apr 20 '22 at 06:56
Okay, I don't have experience with this file pulse connector. I assumed you're using the Confluent S3 sink, which doesn't read offsets from S3... But maybe you have an IAM policy that prevents GetObject calls? In any case, this doesn't seem related to your original questions, so I suggest making a new post or github issue in the file pulse connector repo – OneCricketeer Apr 20 '22 at 13:51
Thanks, @OneCricketeer. I will open a new post. – Noam Apr 24 '22 at 13:09

score 0 · Answer 2 · answered May 18 '22 at 20:59

Yes, you're right, you need to set the replication factor to more than 1 to be able to sustain the broker-level failures. Once you add this value as default, the new topics will start having the configured number of replicas. But for existing topics, you need to follow the below instruction-

Describe the topic

$ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic one
 Topic: one  PartitionCount: 3   ReplicationFactor: 1    Configs: segment.bytes=1073741824
 Topic: one  Partition: 0    Leader: 1   Replicas: 1 Isr: 1
 Topic: one  Partition: 1    Leader: 0   Replicas: 0 Isr: 0
 Topic: one  Partition: 2    Leader: 2   Replicas: 2 Isr: 2

Create the json file with the topic reassignment details

$ cat >>increase.json <<EOF
{
 "version":1,
 "partitions":[
    {"topic":"one","partition":0,"replicas":[0,1,2]},
    {"topic":"one","partition":1,"replicas":[1,0,2]},
    {"topic":"one","partition":2,"replicas":[2,1,0]},
 ]
}
EOF

Execute this reassignment plan

$ ./bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file increase.json --execute
Current partition replica assignment

{"version":1,"partitions":[{"topic":"one","partition":0,"replicas":[0,1,2],"log_dirs":["any","any"]},{"topic":"one","partition":1,"replicas":[1,0,2],"log_dirs":["any","any"]},{"topic":"one","partition":2,"replicas":[2,1.0],"log_dirs":["any","any"]}]}

Save this to use as the --reassignment-json-file option during rollback Successfully started partition reassignments for one-0,one-1,one-2

Describe the topic again

$ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic one
 Topic: one  PartitionCount: 3   ReplicationFactor: 3    Configs: segment.bytes=1073741824
  Topic: one  Partition: 0    Leader: 0   Replicas: 0,1,2 Isr: 0,1,2
  Topic: one  Partition: 1    Leader: 1   Replicas: 1,0,2 Isr: 1,0,2
  Topic: one  Partition: 2    Leader: 2   Replicas: 2,1,0 Isr: 2,1,0

How do I avoid data loss when using Kafka over Kubernetes and one of the nodes fail?

2 Answers2