9

should we ever invoke processorContext.commit() in Processor implementation by ourselves? I mean invoking commit method inside scheduled Punctuator implementation or inside process method.

in which use cases should we do that, and do we need that at all? the question relates to both Kafka DSL with transform() and Processor API.

seems Kafka Streams handles it by itself, also invoking processorContext.commit() does not guarantee that it will be done immediately.

Vasyl Sarzhynskyi
  • 3,689
  • 2
  • 22
  • 55

2 Answers2

9

It is ok to call commit() -- either from the Processor or from a Punctuation -- that's why this API is offered.

While Kafka Streams commits on a regular (configurable) interval, you can request intermediate commits when you use it. One example use case would be, that you usually do cheap computation, but sometimes you do something expensive and want to commit asap after this operation instead of waiting for the next commit interval (to reduce the likelihood of a failure after the expensive operation and the next commit interval). Another use case would be, if you set the commit interval to MAX_VALUE what effectively "disables" regular commits and to decide when to commit base on your business logic.

I guess, that calling commit() is not necessary for most use cases thought.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Are there any guarantees around manually calling commit though? – Vikas Tikoo Jan 08 '19 at 21:26
  • afaict from the code, a processor requesting commit will be processed at end of the current batch being processed by the StreamThread. – Vikas Tikoo Jan 08 '19 at 21:29
  • There are no guarantees -- it's treated as a request and it's an implementation detail, that may change without notice, how quickly the commit will happen. Your observation is correct: we check the flag after processing a certain amount of record (not after each record) in the current implementation. We do this, to keep runtime overhead low and to increase system throughput. – Matthias J. Sax Jan 08 '19 at 21:49
  • Agree with you on it being implementation detail not surfaced in the API. My concern was whether using commit without guarantees is a good pattern. A simpler kafka consumer might be better suited for a use case where an expensive computation is occuring per consumed message? – Vikas Tikoo Jan 08 '19 at 21:53
  • 2
    Guess it's a use-case dependent decision. The "problem" for KafkaStreams is, that `commit()` could be called even if the record is processed only partially (ie, not by all operator of a task) and thus, commit cannot be executed synchronously if requested. The earliest point would be, after the record is fully processed. And because there is already a delay, adding some more delay to get higher throughput seems to be a good trade-off. – Matthias J. Sax Jan 08 '19 at 22:00
  • that makes sense. thanks for explaining the rationale! – Vikas Tikoo Jan 08 '19 at 22:20
0

For the use case I am batching certain number of record in processor process method and writing the batched records to File from process function if the batch size reaches like certain number(lets say 10).

Lets say we write one batch of records to file and system crashes at the point before commit happens (Since we cann't call explicit commits). Next time the stream starts and processor processes the records from the last committed offset. This means we could be writing some duplicate data to files. Is there anyway to avoid writing duplicate data??

  • Using a plain old kafka consumer could give you a smaller window of failure between the file write and commit. – Vikas Tikoo Jan 09 '19 at 18:24
  • Also you should consider writing this as a comment on the question, or a separate question in itself. – Vikas Tikoo Jan 09 '19 at 18:25
  • Thanks Vikas .. Suggestions taken. I didn't have permissions to add comment so posted it as answer. Didn't want to create new post as i felt this fits the context of the original question – user9656219 Jan 09 '19 at 18:28