0

I have a Beam pipeline to consume streaming events with multiple stages (PTransforms) to process them. See the following code,

    pipeline.apply("Read Data from Stream", StreamReader.read())
            .apply("Decode event and extract relevant fields", ParDo.of(new DecodeExtractFields()))
            .apply("Deduplicate process", ParDo.of(new Deduplication()))
            .apply("Conversion, Mapping and Persisting", ParDo.of(new DataTransformer()))
            .apply("Build Kafka Message", ParDo.of(new PrepareMessage()))
            .apply("Publish", ParDo.of(new PublishMessage()))
            .apply("Commit offset", ParDo.of(new CommitOffset()));

The streaming events read by using the KafkaIO and the StreamReader.read() method implementation is like this,

    public static KafkaIO.Read<String, String> read() {
        return KafkaIO.<String, String>read()
                .withBootstrapServers(Constants.BOOTSTRAP_SERVER)
                .withTopics(Constants.KAFKA_TOPICS)
                .withConsumerConfigUpdates(Constants.CONSUMER_PROPERTIES)
                .withKeyDeserializer(StringDeserializer.class)
                .withValueDeserializer(StringDeserializer.class);
    }

After we read a streamed event/message through the KafkaIO, we can commit the offset. What i need to do is commit the offset manually, inside the last Commit offset PTransform when all the previous PTransforms executed.

The reason is, I am doing some conversions, mappings and persisting in the middle of the pipeline and when all the things done without failing, I need to commit the offset. By doing so, if the processing fails in the middle, i can consume same record/event again and process.

My question is, how do I commit the offset manually? Appreciate if its possible to share resources/sample codes.

Bruno Volpato
  • 1,382
  • 10
  • 18
Prasad
  • 83
  • 1
  • 8
  • I'm surprised you are taking this approach. What's wrong with the built-in support for exactly-once processing? – David Anderson Sep 16 '22 at 18:50
  • @David Anderson In my use case, as I mentioned above I read the data through the KafkaIO and do some transformations/processing. We can commit the offset as soon as we read a data, but imagine after i read a record and then committing the offset and going to persist that particular data into a database. For some reason, the database might not available at that time. Since the data is not persisted, we do not have that record in our side and not able to read from Kafka again because we have committed the offset. That is why am trying to commit the offset after all the things done. – Prasad Sep 19 '22 at 04:03
  • 1
    Sure, but Flink and Beam are setup to handle all of that for you. You don't need to worry about these details so long as configure the connectors properly for exactly-once behavior. https://www.docs.immerok.cloud/docs/cookbook/exactly-once-with-apache-kafka-and-apache-flink/ covers how to do this for Flink. I'm less familiar with the details for Beam, but there are resources for this as well, e.g., https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1. – David Anderson Sep 19 '22 at 12:57

1 Answers1

1

Well, for sure, there are Read.commitOffsetsInFinalize() method, that is supposed to commit offsets while finalising the checkpoints, and AUTO_COMMIT consumer config option, that is used to auto-commit read records by Kafka consumer.

Though, in your case, it won't work and you need to do it manually by grouping the offsets of the same topic/partitiona/window and creating a new instance of Kafka client in your CommitOffset DoFn which will commit these offsets. You need to group the offsets by partition, otherwise it may be a race condition with committing the offsets of the same partition on different workers.

Alexey Romanenko
  • 1,353
  • 5
  • 11