0

I'm working on pipeline that reads messages from Kafka using KafkaIO, and I'm looking at commitOffsetsInFinalize() option, and KafkaCheckpointMark class.

I want to achieve at-least-once message delivery semantics and want to be sure that offsets committed to Kafka after they are written to some sink.

Looking at interface of CheckpointMark it's not clear when finalization shall be expected to happen.

Is it runner dependent, what to expect when executing on DataflowRunner ?

And reading KafkaIO.Read javadoc on commitOffsetsInFinalize also doesn't bring clarity to my understanding, particularly the phrase

But it does not provide hard processing guarantees

Question: What is the contract in Beam model for when checkpoint marks shall be finalized, is there any ?

marknorkin
  • 3,904
  • 10
  • 46
  • 82
  • I've also opened issue in Apache Beam JIRA https://jira.apache.org/jira/browse/BEAM-6902, for documentation improvement. – marknorkin Mar 27 '19 at 07:03

2 Answers2

1

Yes, that behaviour is runner dependent. In the DF Runner, Finalization happens in streaming pipelines once the data has been committed into Dataflow's internal state. I.e. when the entire bundle of elements is finished processing.

Based on the doc description commitOffsetsInFinalize helps to reduce reprocessing, but it does now matter if this is used, either way you will have at least once semantics in the DF Runner.

Alex Amato
  • 1,685
  • 10
  • 15
  • Can you elaborate please on why it does not matter if this option is used for Dataflow ? Is there no difference in Dataflow if I use Kafka's autocommit or commitOffsetsInFinalize ? – marknorkin Mar 31 '19 at 14:12
  • Will it be possible for you to point to the concrete code which is responsible for that behaviour, if it's open sourced ? – marknorkin Mar 31 '19 at 14:13
1

When using the Dataflow runner, checkpoint finalization happens once the results of reading from the source have been durably committed to Dataflow's internal state. This guarantees exactly-once processing as long as you update or drain your pipelines, but not if you cancel a running pipeline. When commitOffsetsInFinalize is set to true, this will cause Dataflow to commit partition offsets in this way.

When commitOffsetsInFinalize is false, KafkaIO uses a different, more efficient way of reading from Kafka. In this mode, Dataflow (or other runners) will store the offsets up to which it has read for each partition. In this mode, there is no data loss concern because data is not consumed from Kafka, and new pipelines can specify exactly where in the Kafka stream to start reading

danielm
  • 3,000
  • 10
  • 15
  • can you be more precise with what is different and efficient way is ? do you mean enabling auto-commit ? if so, how is it able due resolve the issue around 'no data loss' ? – marknorkin Apr 02 '19 at 06:19
  • also on this note do you have any links or other materials that will add to your point about 'exactly once guarantees' ? – marknorkin Apr 02 '19 at 06:20
  • When commitOffsetsInFinalize is false, instead of committing offsets to Kafka, the KafkaIO source will internally store the offsets it has read. This means the data in Kafka is not consumed, so when you start a new pipeline, you can start it as far back in the stream as you'd like. – danielm Apr 02 '19 at 21:19
  • Thank you for elaborating on this one, however I can't still understand how this guarantees exactly once delivery in case of failure scenarios or simple pipeline updates. – marknorkin Apr 03 '19 at 06:30