1

I'm using KafkaIO in dataflow to read messages from one topic. I use the following code.

KafkaIO.<String, String>read()
                .withReadCommitted()
                .withBootstrapServers(endPoint)
                .withConsumerConfigUpdates(new ImmutableMap.Builder<String, Object>()
                .put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
                .put(ConsumerConfig.GROUP_ID_CONFIG, groupName)
                .put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 8000).put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, 2000)
                .build())
//                .commitOffsetsInFinalize()
                .withTopics(Collections.singletonList(topicNames))
                .withKeyDeserializer(StringDeserializer.class)
                .withValueDeserializer(StringDeserializer.class)
                .withoutMetadata();

I run the dataflow program in my local using the direct runner. Everything runs fine. I run another instance of the same program in parallel i.e another consumer. Now I see duplicate messages in processing of the pipeline.

Though I have provided consumer group id, starting another consumer with same consumer group id(different instance of the same program) shouldn't be processing same elements that are processed by another consumer right?

How does this turn out using dataflow runner?

bigbounty
  • 16,526
  • 5
  • 37
  • 65

1 Answers1

2

I don't think the options you have set guarantees non-duplicate delivery of messages across pipelines.

  • ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG: This is a flag for the Kafka consumer not for Beam pipeline itself. Seems like this is best effort and periodic so you might still see duplicates across multiple pipelines.

  • withReadCommitted(): This just means that Beam will not read uncommitted messages. Again, it will not prevent duplicates across multiple pipelines.

See here for the protocol Beam source use to determine the starting point of the Kafka source.

To guarantee non-duplicate delivery probably you have to read from different topics or different subscriptions.

chamikara
  • 1,896
  • 1
  • 9
  • 6
  • Even I am facing this issue, I want to run multiple instance of consumer, reading from same topic. But message is getting delivered to all running instances. After debugging configuration I found, group name is getting appended with some unique prefix and each instance has unique group name. ex - group.id = Reader-0_offset_consumer_559337182_my_group group.id = Reader-0_offset_consumer_559337345_my_group So each instance has unique group.id assigned and thats why messages are getting delivered to all instances. – Aditya Jul 20 '20 at 17:43
  • @Aditya I am using dataflow pipeline and in my case group.id is also same for both the consumers, still messages are getting processed by both the consumers. Have you got any solution? – Ankit Adlakha Aug 09 '21 at 06:00