0

This question comes in mind as we are running kafka streams applications without EOS enabled due to infra constraints. We are unsure of its behavior when doing some custom logic using transformer/processor API with changeloged state stores .

Say we are using following topology to de-duplicate records before sending to downstream:

[topic] -> [flatTransformValues + state store] -> [...(downstream)]

the transformer here will compare incoming records against the state store and only forward + update the record when there's a value change, so for messages [A:1], [A:1], [A:2], we expect downstream will only get [A:1], [A:2]

Question is when failures happens, is it possible that [A:2] get stored in the state store's changelog, while downstream does not receive the message, so that any retry reading [A:2] will discard the record and its lost forever?

If not, please tell me if any mechanism prevent this happening, one way i think it could work is if kafka stream produce to changelog topics and commit offsets only after produce to downstream succeeds?

Much appreciated!

Alex
  • 3
  • 2

1 Answers1

0

Question is when failures happens, is it possible that [A:2] get stored in the state store's changelog, while downstream does not receive the message, so that any retry reading [A:2] will discard the record and its lost forever?

Yes, that's possible. At-least-once only guarantees that the event will be re-read and re-processed. But for your case, the modified state would modify the second processing and detect the event as a duplicate.

In the end, it does not make sense anyway, to write a de-duplication program and not use exactly-once guarantees anyways. Even if you could prevent the scenario you describe, using at-least-once processing could introduce duplicates by itself...

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks for the answer! However, in our use case the deduplication is only meant for performance reason and downstream topology are idempotent, it's a bit awkward to see kafka stream doesn't send records down before committing to state store while the processing guarantee is stated as "at least once" – Alex Feb 26 '23 at 10:44
  • I know what you mean, but this is how KS is implemented. Store updates are applied directly, but not "staged" (not to be confused with the caches KS has -- they serve a different purpose). -- There is no strict definition of "at least once", and KS ensures that each record is processed (ie, read) at-least-once. – Matthias J. Sax Feb 28 '23 at 00:18