KafkaIO checkpoint persistence with Google Dataflow Runner

Question

I am trying to understand how the offsets and group management works with the Google Dataflow runner with KafkaIO reader. More specifically, I am trying to understand how offset management works:

If the group.id config is set and if auto-commit and commitOffsetsInFinalize are disabled.
If the group.id config is not set, how does offset and group management work?

Any code/document reference pointing in the right direction is appreciated.

score 0 · Answer 1 · answered Jan 12 '21 at 00:15

0

The KafkaIO reader is entirely part of Apache Beam. Google Cloud Dataflow does not treat this source differently than any other Beam source.

You can find its code at https://github.com/apache/beam/tree/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka (in various files). I don't know of any reference documentation other than the Javadoc.

answered Jan 12 '21 at 00:15

Kenn Knowles

5,838
18
22

If you check the documentation here -> https://beam.apache.org/releases/javadoc/2.27.0/org/apache/beam/sdk/io/kafka/KafkaIO.html, the Partition Assignment and Checkpointing section says the following: In summary, KafkaIO.read follows below sequence to set initial offset: 1. KafkaCheckpointMark provided by runner; This to me indicates that the runner can maintain an internal state with checkpoint values. – Viraj Jan 12 '21 at 11:06
The checkpoint mark is provided by the runner when it is resuming reading. – Kenn Knowles Jan 12 '21 at 21:54
In that case, how is the runner maintaining the previous checkpoint values when resuming reading from the same topic? – Viraj Jan 19 '21 at 17:05
The checkpoint mark is produced by KafkaIO and serialized. The runner's job is to store it and then provide it again later when KafkaIO is resumed. – Kenn Knowles Feb 17 '21 at 23:26

KafkaIO checkpoint persistence with Google Dataflow Runner

1 Answers1