Apache Beam: Reading from Kafka starting at the initial offset rather than the latest

Question

I am trying to write a simple Beam pipeline that starts consuming data from the earliest offsets existing in the partitions of each Kafka Topic.

I have not been able to figure out how to consume data from the earliest possible offsets in a topic.

score 2 · Accepted Answer · answered Sep 06 '22 at 22:40

In general, KafkaConsumer instances will try to consume data from the latest offset for each partition. This means they'll start consuming only new messages published to the topic.

If you want your pipeline to start by consuming the earliest available Kafka offsets, then you can achieve this by calling withStartReadTime parameter:

p.apply(KafkaIO.read()
    .withBootstrapServers(KAFKA_BOOTSTRAP_SERVER)
    .withTopic(KAFKA_TOPIC)
    // By reading data from EPOCH, you'll ensure the earliest messages are consumed
    .withStartReadTime(Instant.EPOCH))

And that should do it!

Apache Beam: Reading from Kafka starting at the initial offset rather than the latest

1 Answers1