How to handle kafka publishing failure in robust way

Question

I'm using Kafka and we have a use case to build a fault tolerant system where not even a single message should be missed. So here's the problem: If publishing to Kafka fails due to any reason (ZooKeeper down, Kafka broker down etc) how can we robustly handle those messages and replay them once things are back up again. Again as I say we cannot afford even a single message failure. Another use case is we also need to know at any given point in time how many messages were failed to publish to Kafka due to any reason i.e. something like counter functionality and now those messages needs to be re-published again.

One of the solution is to push those messages to some database (like Cassandra where writes are very fast but we also need counter functionality and I guess Cassandra counter functionality is not that great and we don't want to use that.) which can handle that kind of load and also provide us with the counter facility which is very accurate.

This question is more from architecture perspective and then which technology to use to make that happen.

PS: We handle some where like 3000TPS. So when system start failing those failed messages can grow very fast in very short time. We're using java based frameworks.

Thanks for your help!

Hi @Nishant, did you find a "solution"? Care to share with the community? Thanks in advance. — jumping_monkey, Oct 21 '19 at 08:31
Maybe you need an append only database, like timescaledb or influxdb. For those the 3k events per second is not a big deal. — inf3rno, Dec 18 '19 at 19:08
I don't know much about the topic, but it seems easier to do this with a pulling approach instead of pushing. So you can add a webservice to the sender side and you can poll the webservice from the receiver side. So the receiver will be responsible for getting the message and not the sender or another component in the middle will be responsible for delivering it to all receivers, maintaining the receiver list, retrying, etc... But I guess this is not always an option, because it is not fast enough or maybe for other reasons I don't know of. — inf3rno, Dec 18 '19 at 19:43

Chris Matta · Answer 1 · 2018-10-31T18:40:14.270

6

The reason Kafka was built in a distributed, fault-tolerant way is to handle problems exactly like yours, multiple failures of core components should avoid service interruptions. To avoid a down Zookeeper, deploy at least 3 instances of Zookeepers (if this is in AWS, deploy them across availability zones). To avoid broker failures, deploy multiple brokers, and ensure you're specifying multiple brokers in your producer bootstrap.servers property. To ensure that the Kafka cluster has written your message in a durable manor, ensure that the acks=all property is set in the producer. This will acknowledge a client write when all in-sync replicas acknowledge reception of the message (at the expense of throughput). You can also set queuing limits to ensure that if writes to the broker start backing up you can catch an exception and handle it and possibly retry.

Using Cassandra (another well thought out distributed, fault tolerant system) to "stage" your writes doesn't seem like it adds any reliability to your architecture, but does increase the complexity, plus Cassandra wasn't written to be a message queue for a message queue, I would avoid this.

Properly configured, Kafka should be available to handle all your message writes and provide suitable guarantees.

edited Oct 31 '18 at 18:40

answered Oct 21 '16 at 19:26

Chris Matta

3,263
3
35
48

2

Thanks Chris! I understand Kafka was designed in a way to handle such situation but making this as an argument to say things will always work as it's supposed to is a little bold statement and to me it's doomed to fail sooner or later. Just to give you an example how even though you have enough broker and enough zookeeper instances things can still go out of control. For example: If one topic has 3 replicas and setting min.insync.replicas to 2 i.e. writes to broker will succeed only when 2 out of 3 replica are in sync. Now in this case if replica are not in sync, it'll not accept new request. – User5817351 Oct 24 '16 at 06:10
1

@Coder this might be a helpful blog about making sure your cluster is properly configured to help keep your lagging replicas as members of the ISR: http://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/ – Chris Matta Oct 24 '16 at 14:12
Thanks @Chris this is useful! – User5817351 Oct 26 '16 at 19:00
what about network failure and kafka become unreachable. – Lovin Aug 10 '21 at 15:33
@Lovin this is mitigated by deploying across at least three availability zones – Chris Matta Aug 11 '21 at 19:05
@ChrisMatta what if you are trying connect from an application hosted on prem to aws MSK and there is a network outage? – Payam R Aug 23 '23 at 10:12

Akhil Ghatiki · Answer 2 · 2020-02-11T04:43:03.080

I am super late to the party. But I see something missing in above answers :)

The strategy of choosing some distributed system like Cassandra is a decent idea. Once the Kafka is up and normal, you can retry all the messages that were written into this.

I would like to answer on the part of "knowing how many messages failed to publish at a given time"

From the tags, I see that you are using apache-kafka and kafka-consumer-api.You can write a custom call back for your producer and this call back can tell you if the message has failed or successfully published. On failure, log the meta data for the message.

Now, you can use log analyzing tools to analyze your failures. One such decent tool is Splunk.

Below is a small code snippet than can explain better about the call back I was talking about:

public class ProduceToKafka {

  private ProducerRecord<String, String> message = null;

 // TracerBulletProducer class has producer properties
  private KafkaProducer<String, String> myProducer = TracerBulletProducer
      .createProducer();

  public void publishMessage(String string) {

    ProducerRecord<String, String> message = new ProducerRecord<>(
        "topicName", string);

    myProducer.send(message, new MyCallback(message.key(), message.value()));
  }

  class MyCallback implements Callback {

    private final String key;
    private final String value;

    public MyCallback(String key, String value) {
      this.key = key;
      this.value = value;
    }


    @Override
    public void onCompletion(RecordMetadata metadata, Exception exception) {
      if (exception == null) {
        log.info("--------> All good !!");
      } else {
        log.info("--------> not so good  !!");
        log.info(metadata.toString());
        log.info("" + metadata.serializedValueSize());
        log.info(exception.getMessage());

      }
    }
  }

}

If you analyze the number of "--------> not so good !!" logs per time unit, you can get the required insights.

God speed !

I think the line `if (exception != null) {` needs to say `if (exception == null) {` — Robin Zimmermann, Feb 11 '20 at 01:44

score 4 · Answer 3 · answered Oct 23 '16 at 04:53

Chris already told about how to keep the system fault tolerant.

Kafka by default supports at-least once message delivery semantics, it means when it try to send a message something happens, it will try to resend it.

When you create a Kafka Producer properties, you can configure this by setting retries option more than 0.

 Properties props = new Properties();
 props.put("bootstrap.servers", "localhost:4242");
 props.put("acks", "all");
 props.put("retries", 0);
 props.put("batch.size", 16384);
 props.put("linger.ms", 1);
 props.put("buffer.memory", 33554432);
 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

 Producer<String, String> producer = new KafkaProducer<>(props);

For more info check this.

Thanks @Shankar. There are essentially two kinds of failure retriable and non retriable. This property to retry is helpful only when there is retriable failure. For example when error from broker when leader went down and zooKeeper is busy assigning new leader etc. Such kinds of failure are retriable and above property will work. But if there is a non retriable then no matter how higher we set that property it's not going work. Thanks for input though! — User5817351, Oct 26 '16 at 19:00
@Coder : Thanks for the inputs.. could you please let me know what are those non-retriable failures? — Shankar, Oct 27 '16 at 01:25

How to handle kafka publishing failure in robust way

3 Answers3