0

I am trying to understand how the offsets and group management works with the Google Dataflow runner with KafkaIO reader. More specifically, I am trying to understand how offset management works:

  • If the group.id config is set and if auto-commit and commitOffsetsInFinalize are disabled.
  • If the group.id config is not set, how does offset and group management work?

Any code/document reference pointing in the right direction is appreciated.

Viraj
  • 777
  • 1
  • 13
  • 32

1 Answers1

0

The KafkaIO reader is entirely part of Apache Beam. Google Cloud Dataflow does not treat this source differently than any other Beam source.

You can find its code at https://github.com/apache/beam/tree/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka (in various files). I don't know of any reference documentation other than the Javadoc.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • If you check the documentation here -> https://beam.apache.org/releases/javadoc/2.27.0/org/apache/beam/sdk/io/kafka/KafkaIO.html, the Partition Assignment and Checkpointing section says the following: In summary, KafkaIO.read follows below sequence to set initial offset: 1. KafkaCheckpointMark provided by runner; This to me indicates that the runner can maintain an internal state with checkpoint values. – Viraj Jan 12 '21 at 11:06
  • The checkpoint mark is provided by the runner when it is resuming reading. – Kenn Knowles Jan 12 '21 at 21:54
  • In that case, how is the runner maintaining the previous checkpoint values when resuming reading from the same topic? – Viraj Jan 19 '21 at 17:05
  • The checkpoint mark is produced by KafkaIO and serialized. The runner's job is to store it and then provide it again later when KafkaIO is resumed. – Kenn Knowles Feb 17 '21 at 23:26