2

using KSQL, and performing left outer join, i can see the result of my join sometime emitted more than once.

In other words, the same join result is emitted more than once. I am not talking about, a version of the join with the null value on the right side and a version without the null value. Literally the same record that result from a join is emitted more than once.

I wonder if that is an expected behaviour.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
MaatDeamon
  • 9,532
  • 9
  • 60
  • 127

1 Answers1

3

the general answer is yes. kafka is an at-least-once system. more specifically, a few scenarios can result in duplication:

  1. consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
  2. producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
  3. if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.

are you seeing any such crashes/timeouts in your logs?

there are a few kafka features you could try using to reduce the likelihood of this happening to you:

  1. set enable.idempotence to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead
  2. use transactions when producing - incurs overhead and adds latency
  3. set transactional.id on the producer in case your fail over across machines - gets complicated to manage at scale
  4. set isolation.level to read_committed on the consumer - adds latency (needs to be done in combination with 2 above)
  5. shorten auto.commit.interval.ms on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.
radai
  • 23,949
  • 10
  • 71
  • 115
  • How come setting “exactly_once” as processing garantee do not solve it ? – MaatDeamon Sep 12 '19 at 17:48
  • 1
    "ksql.streams.processing.guarantee": "exactly_once" – MaatDeamon Sep 13 '19 at 01:02
  • https://stackoverflow.com/questions/57878221/does-ksql-support-kafka-stream-processing-guarantees/57899003#57899003 – MaatDeamon Sep 13 '19 at 01:02
  • @MaatDeamon - there's no magic, only overhead. that setting will (if i read the docs correctly) simply set all the configurations i provided above for all producers and consumers that are under ksql's control. note that even in the official docs - https://docs.ksqldb.io/en/latest/concepts/processing-guarantees/#exactly-once-semantics - you are warned that to truly get exactly once even the upstream/downstream clients outside of ksql's control need to be properly configured. and you should really measure the performance impact vs the damage of dups – radai Sep 20 '20 at 00:16