I am trying to write a simple Beam pipeline that starts consuming data from the earliest offsets existing in the partitions of each Kafka Topic.
I have not been able to figure out how to consume data from the earliest possible offsets in a topic.
I am trying to write a simple Beam pipeline that starts consuming data from the earliest offsets existing in the partitions of each Kafka Topic.
I have not been able to figure out how to consume data from the earliest possible offsets in a topic.
In general, KafkaConsumer
instances will try to consume data from the latest offset for each partition. This means they'll start consuming only new messages published to the topic.
If you want your pipeline to start by consuming the earliest available Kafka offsets, then you can achieve this by calling withStartReadTime
parameter:
p.apply(KafkaIO.read()
.withBootstrapServers(KAFKA_BOOTSTRAP_SERVER)
.withTopic(KAFKA_TOPIC)
// By reading data from EPOCH, you'll ensure the earliest messages are consumed
.withStartReadTime(Instant.EPOCH))
And that should do it!