1

I am running multiple instance of apache beam KafkaIO using DirectRunner, which are reading from same topic. But message is getting delivered to all running instances. After seeing Kafka configuration I found, group name is getting appended with some unique prefix and each instance has unique group name.

  1. group.id = Reader-0_offset_consumer_559337182_my_group
  2. group.id = Reader-0_offset_consumer_559337345_my_group

So each instance has unique group.id assigned and thats why messages are getting delivered to all instances.

pipeline.apply("ReadFromKafka", KafkaIO.<String, String>read().withReadCommitted()
            .withConsumerConfigUpdates(
                    new ImmutableMap.Builder<String, Object>().put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
                            .put(ConsumerConfig.GROUP_ID_CONFIG, "my_group")
                            .put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 5).build())
            .withKeyDeserializer(StringDeserializer.class).withValueDeserializer(StringDeserializer.class)
            .withBootstrapServers(servers).withTopics(Collections.singletonList(topicName)).withoutMetadata()

So what configuration I have to give so that all consumers in a group doesn't read same message

Aditya
  • 207
  • 2
  • 13
  • What is a reason to run multiple instances of KafkaIO with DirectRunner and reading from the same topic? – Alexey Romanenko Jul 21 '20 at 13:07
  • @AlexeyRomanenko, we are not using GCP and running it on our own bare metal. so we can't use dataflow. So we want to scale by deploying in k8s pod and increase no of pod. But problem here what i see is, since each instance gets assigned a unique groupId, when ever i send a message, message goes to all group/instance. Hope this clarifies the problem – Aditya Jul 23 '20 at 14:13
  • I'd not recommend you to use DirectRunner in production for significant amount of data since this runner is supposed to be used mostly for testing, it contains and executes many additional checks during pipeline running and that is why it could be quite slow comparing to other runners. Would it be an option for you to use Spark or Flink runners over distributed Spark or Flink clusters? – Alexey Romanenko Jul 23 '20 at 14:56
  • @AlexeyRomanenko No as of now we don't have option of using Spark of Flink. Also, please revert negative vote as it is a valid scenario – Aditya Jul 27 '20 at 07:19
  • I didn't vote negatively but I gave +1 to your post. I expect that people can have different cases, I just recommend how it could be better to use. – Alexey Romanenko Jul 27 '20 at 13:41
  • @Aditya did you ever come up with a solution for this? I am having a somewhat similar situation. In my case I want the same groupId to remain after the restart in case if the Beam Job crashes. Any findings from your side would be appreciated. Thanks – user3693309 Dec 17 '20 at 23:19
  • @user3693309 We moved to dataflow. – Aditya Dec 19 '20 at 04:16
  • @Aditya I am using dataflow pipeline and in my case group.id is also same for both the consumers, still messages are getting processed by both the consumers. Have you got any solution? – Ankit Adlakha Aug 09 '21 at 06:01

1 Answers1

0

Yes, this happens because group name is getting appended with some unique prefix and each instance has unique group name. Because of this kafka doesn't know if you spin up one more instance. Hence, same messages get delivered to all consumers.

Hence, one work around I could think of is instead of giving topic and let beam figure out the number of consumers for all the partitions, you can explicitly give the topic partitions for each instance of apache beam KafkaIO using DirectRunner.

You will have to pass a List of type TopicPartition to the method withTopicPartitions.

KafkaIO.<String, String>read()
                .withCreateTime(Duration.standardMinutes(1))
                .withReadCommitted()
                .withBootstrapServers(endPoint)
                .withConsumerConfigUpdates(new ImmutableMap.Builder<String, Object>()
                        .put(ConsumerConfig.GROUP_ID_CONFIG, groupName)
                        .put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 5)
                        .put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
                        .build())
                .withTopicPartitions(Arrays.asList(new TopicPartition(topicName, 0)))
                .withKeyDeserializer(StringDeserializer.class)
                .withValueDeserializer(StringDeserializer.class)
                .withoutMetadata();

The above code will read messages only from partition 0. Hence, this way you can spin up multiple instances of the same program without same messages getting delivered to all consumers

bigbounty
  • 16,526
  • 5
  • 37
  • 65