1

I got a stateless Kafka Stream that consumes from a topic and publishes into a different queue (Cloud PubSub) within a forEach. The topology does not end on producing into a new Kafka topic.

How do I know which delivery semantic I can guarantee? Knowing that it's just a message forwarder and no deserialisation or any other transformation or whatsoever is applied: are there any cases in which I could have duplicates or missed messages?

I'm thinking about the following scenarios and related impacts on how offsets are commited:

  • Sudden application crash
  • Error occurring on publish

Thanks guys

user2274307
  • 124
  • 3
  • 16

1 Answers1

0

If You consider the kafka to kafka loop that a Kafka Stream application usually creates, setting the property:

processing.guarantee=exactly_once

It's enough to have exactly-once semantic, of course also in failure scenarios.

Under the hood Kafka uses a transaction to guarantee that the consume - process - produce - commit offset processing is executed with all or nothing guarantee.

Writing a sink connector with exaclty once semantic kafka to Google PubSub, would mean solving the same issues Kafka solves already for the kafka to kafka scenario.

  1. The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
  2. We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
  3. Finally, in distributed environments, applications will crash or—worse!temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”

Assuming your producer logic to Cloud PubSub does not suffer from problem 1, just like Kafka producers when using enable.idempotence=true, you are still left with problems 2 and 3.

Without solving these issues your processing semantic will be the delivery semantic your consumer is using, so at least once, if you choose to manually commit the offset.

GionJh
  • 2,742
  • 2
  • 29
  • 68
  • Thank you for your answer, I'll accept it cause it answers the question. I got a few ones though: if I leave Kafka Stream automatically commit offsets I won't guarantee not even the `at least once` semantic, is it right? Which means that I must commit my offsets to achieve the same efficiency Kafka Connect PubSub Sink Connector has (`at least once` by documentation). – user2274307 May 13 '20 at 10:59
  • Just saw that Kafka Streams indeed does not allow to handle offsets manually. My pipeline is the simplest: _consume and immediately produce into PubSub_, so my naive approach is to **let it crash** when publish operation fails to avoid committing offsets. I guess this is not a good approach. – user2274307 May 13 '20 at 11:15
  • You can look for a Sink connector that could be already available: https://www.confluent.io/hub/?utm_medium=sem&utm_source=google&utm_campaign=ch.sem_br.nonbrand_tp.prs_tgt.kafka_mt.mbm_rgn.emea_lng.eng_dv.all&utm_term=%2Bkafka%20%2Bsink&creative=&device=c&placement=&gclid=EAIaIQobChMIvqDtzt2w6QIVx-N3Ch2ZngHIEAAYASAAEgJtU_D_BwE – GionJh May 13 '20 at 11:27
  • The alternative is to write your own and indeed for this canse you could just think about chaining a consumer and a producer a bit like the code snippet you can see here https://www.confluent.io/blog/transactions-apache-kafka/ – GionJh May 13 '20 at 11:27
  • I won't write my own connector, rather I'll either accept the risk of low consistency of my Kafka Stream or opt for the existent Kafka Connect connector – user2274307 May 13 '20 at 11:41
  • "exaclty_once" processing guaranteed only cover read-process-write from and to Kafka topics. Writing to an external system is a side-effect and not covered! -- To guarantee at-least-once processing you must ensure that your external write is sync. – Matthias J. Sax May 17 '20 at 22:38
  • @MatthiasJ.Sax I re-read my answer, I did not say that, I was trying to explain what are the challanges to achieve that. – GionJh May 18 '20 at 12:16
  • Well, you write `It's enough to have exactly-once semantic, of course also in failure scenarios.` -- this is miss leading. – Matthias J. Sax May 18 '20 at 18:13