I was trying to understand Kafka's transactional API. This link defines atomic read-process-write cycle as follows:
First, let’s consider what an atomic read-process-write cycle means. In a nutshell, it means that if an application consumes a message A at offset X of some topic-partition tp0, and writes message B to topic-partition tp1 after doing some processing on message A such that B = F(A), then the read-process-write cycle is atomic only if messages A and B are considered successfully consumed and published together, or not at all.
It further says says following:
Using vanilla Kafka producers and consumers configured for at-least-once delivery semantics, a stream processing application could lose exactly once processing semantics in the following ways:
The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
Finally, in distributed environments, applications will crash or—worse!—temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”
We designed transaction APIs in Kafka to solve the second and third problems. Transactions enable exactly-once processing in read-process-write cycles by making these cycles atomic and by facilitating zombie fencing.
Doubts:
Points 2 and 3 above describe when message duplication can occur which are dealt with using transactional API. Does transactional API also help to avoid message loss in any scenario?
Most online (for example, here and here) examples of Kafka transactional API involve:
while (true) { ConsumerRecords records = consumer.poll(Long.MAX_VALUE); producer.beginTransaction(); for (ConsumerRecord record : records) producer.send(producerRecord(“outputTopic”, record)); producer.sendOffsetsToTransaction(currentOffsets(consumer), group); producer.commitTransaction(); }
This is basically read-process-write loop. So does transactional API useful only in read-process-write loop?
This article gives example of transactional API in non read-process-write scenario:
producer.initTransactions(); try { producer.beginTransaction(); producer.send(record1); producer.send(record2); producer.commitTransaction(); } catch(ProducerFencedException e) { producer.close(); } catch(KafkaException e) { producer.abortTransaction(); }
It says:
This allows a producer to send a batch of messages to multiple partitions such that either all messages in the batch are eventually visible to any consumer or none are ever visible to consumers.
Is this example correct and shows another way to use transactional API different from read-process-write loop? (Note that it also does not commit offset to transaction.)
In my application, I simply consume messages from kafka, do processing and log them to the database. That is my whole pipeline.
a. So, I guess this is not read-process-write cycle. Is Kafka transactional API of any use to my scenario?
b. Also I need to ensure that each message is processed exactly once. I guess setting
idempotent=true
in producer will suffice and I dont need transactional API, right?c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)