Azure Event Hubs for Kafka with 2 consumers from the same group rebalance infinitely

Question

I use Azure Event Hubs for Kafka and Spring Kafka 1.3.5 (for compatibility reasons) on consumer site. Here is my config for that:

@EnableKafka
@Configuration
class EventHubsKafkaConfig(@Value("\${eventhubs.broker}") val eventHubsBroker: String,
                           @Value("\${eventhubs.new-mails.shared-access-key}") val newMailsEventHubSharedKey: String,
                           @Value("\${eventhubs.consumer-group}") val consumerGroup: String) {
    @Bean
    fun kafkaListenerContainerFactory(consumerFactory: ConsumerFactory<Int, NewMailEvent>):
            ConcurrentKafkaListenerContainerFactory<Int, NewMailEvent> {
        val factory = ConcurrentKafkaListenerContainerFactory<Int, NewMailEvent>()
        factory.consumerFactory = consumerFactory
        return factory
    }

    @Bean
    fun consumerFactory(consumerConfigs: Map<String, Any>) =
            DefaultKafkaConsumerFactory<Int, NewMailEvent>(consumerConfigs, IntegerDeserializer(),
                    JsonDeserializer(NewMailEvent::class.java, jacksonObjectMapper()))

    @Bean
    fun consumerConfigs(): Map<String, Any> {
        val connectionString = "Endpoint=sb://${eventHubsBroker}/;" +
                "SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=${newMailsEventHubSharedKey}"

        return mapOf(
                ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG to "${eventHubsBroker}:9093",
                ConsumerConfig.GROUP_ID_CONFIG to consumerGroup,
                CommonClientConfigs.SECURITY_PROTOCOL_CONFIG to "SASL_SSL",
                SaslConfigs.SASL_MECHANISM to "PLAIN",
                SaslConfigs.SASL_JAAS_CONFIG to "org.apache.kafka.common.security.plain.PlainLoginModule required " +
                        "username=\"\$ConnectionString\" password=\"$connectionString\";",
                ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG to IntegerDeserializer::class.java,
                ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG to JsonDeserializer::class.java
        )
    }
}

and the consumer component:

@Component
class NewMailEventConsumer {
    @KafkaListener(topics = ["\${eventhubs.new-mails.topic-name}"])
    fun newMails(newMailEvent: NewMailEvent) {
        logger.info { "new mail event: $newMailEvent" }
    }

    companion object : KLogging()
}

data class NewMailEvent(val mailbox: String, val mailUuid: String)

When I start 2 consumers app with this code I see strange warnings which never ends:

Successfully joined group offer-application-bff-local with generation 5
web_1  | 2018-07-09 11:20:42.950  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : Setting newly assigned partitions [offer-mail-crawler-new-mails-0] for group offer-application-bff-local
web_1  | 2018-07-09 11:20:42.983  INFO 1 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer    : partitions assigned:[offer-mail-crawler-new-mails-0]
web_1  | 2018-07-09 11:21:28.686  WARN 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : Auto-commit of offsets {offer-mail-crawler-new-mails-0=OffsetAndMetadata{offset=4, metadata=''}} failed for group offer-application-bff-local: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
web_1  | 2018-07-09 11:21:28.687  WARN 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : Auto-commit of offsets {offer-mail-crawler-new-mails-0=OffsetAndMetadata{offset=4, metadata=''}} failed for group offer-application-bff-local: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
web_1  | 2018-07-09 11:21:28.687  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : Revoking previously assigned partitions [offer-mail-crawler-new-mails-0] for group offer-application-bff-local
web_1  | 2018-07-09 11:21:28.687  INFO 1 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer    : partitions revoked:[offer-mail-crawler-new-mails-0]
web_1  | 2018-07-09 11:21:28.688  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.AbstractCoordinator  : (Re-)joining group offer-application-bff-local
web_1  | 2018-07-09 11:21:29.670  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.AbstractCoordinator  : Marking the coordinator bap-event-hubs-dev.servicebus.windows.net:9093 (id: 2147483647 rack: null) dead for group offer-application-bff-local
web_1  | 2018-07-09 11:21:43.099  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.AbstractCoordinator  : Discovered coordinator bap-event-hubs-dev.servicebus.windows.net:9093 (id: 2147483647 rack: null) for group offer-application-bff-local.
web_1  | 2018-07-09 11:21:43.131  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.AbstractCoordinator  : (Re-)joining group offer-application-bff-local
web_1  | 2018-07-09 11:21:43.344  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.AbstractCoordinator  : Successfully joined group offer-application-bff-local with generation 7
web_1  | 2018-07-09 11:21:43.345  INFO 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : Setting newly assigned partitions [offer-mail-crawler-new-mails-0] for group offer-application-bff-local
web_1  | 2018-07-09 11:21:43.375  INFO 1 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer    : partitions assigned:[offer-mail-crawler-new-mails-0]
web_1  | 2018-07-09 11:21:46.377  WARN 1 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : Auto-commit of offsets {offer-mail-crawler-new-mails-0=OffsetAndMetadata{offset=4, metadata=''}} failed for group offer-application-bff-local: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.

Periodically there is this following exception:

2018-07-09 11:36:21.602  WARN 1 --- [ntainer#0-0-C-1] org.apache.kafka.common.protocol.Errors  : Unexpected error code: 60.
web_1  | 2018-07-09 11:36:21.603 ERROR 1 --- [ntainer#0-0-C-1] essageListenerContainer$ListenerConsumer : Container exception
web_1  |
web_1  | org.apache.kafka.common.KafkaException: Unexpected error in join group response: The server experienced an unexpected error when processing the request
web_1  |    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:504) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:455) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:808) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:788) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:204) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:167) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:127) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:488) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:348) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:168) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:364) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:297) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043) ~[kafka-clients-0.11.0.2.jar!/:na]
web_1  |    at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:628) ~[spring-kafka-1.3.5.RELEASE.jar!/:na]
web_1  |    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_151]
web_1  |    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_151]

and periodically this one

Failed to send SSL Close message 

java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_162]
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_162]
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_162]
    at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_162]
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_162]
    at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:194) ~[kafka-clients-0.11.0.2.jar:na]

With single consumer it works like a charm, there are no warnings, nothing. Does anyone have a clue what is going wrong there?

score 3 · Accepted Answer · answered Jul 10 '18 at 10:25

3

Eventually, I found out what the problem was. As you could see in the code, I didn't specify client.id property in the kafka consumer. That was crucial for the spring-kafka, because it tried to use some auto-generated client.id = consumer-0 for both consumers inside of the consumer group. That resulted in the infinite rebalancing of partitions between the two consumers with the same name. I needed to set it to a partial random string ConsumerConfig.CLIENT_ID_CONFIG to "bff-${UUID.randomUUID()}" to get it working:

@Bean
    fun consumerConfigs(): Map<String, Any> {
        val connectionString = "Endpoint=sb://${eventHubsBroker}/;" +
                "SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=${newMailsEventHubSharedKey}"

        return mapOf(
                ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG to "${eventHubsBroker}:9093",
                ConsumerConfig.CLIENT_ID_CONFIG to "bff-${UUID.randomUUID()}",
                ConsumerConfig.GROUP_ID_CONFIG to consumerGroup,
                CommonClientConfigs.SECURITY_PROTOCOL_CONFIG to "SASL_SSL",
                SaslConfigs.SASL_MECHANISM to "PLAIN",
                SaslConfigs.SASL_JAAS_CONFIG to "org.apache.kafka.common.security.plain.PlainLoginModule required " +
                        "username=\"\$ConnectionString\" password=\"$connectionString\";",
                ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG to IntegerDeserializer::class.java,
                ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG to JsonDeserializer::class.java
        )
    }

answered Jul 10 '18 at 10:25

Roman T

1,400
5
18
31

Interesting. Do you know why the consumers used to same auto-generated? – kkflf Jul 10 '18 at 11:20
I guess it's default behavior of the spring-kafka, if there is no client.id specified. But I didn't investigate it. Have you used spring-kafka for such consuming purposes, or just kafka-clients or spring-cloud-stream? – Roman T Jul 10 '18 at 11:42
I work with Spring-Kafka daily and I am in the process of learning all the details. I have previous used multiple consumers in single application, and I did not specific client-id nor did I get any errors based on the client-id. I have only had trouble when I exceeded consumers per group-id compared to partitions. I assume the client-id has changed over time in Spring-Kafka. In the documentation it states that the client-id is a prefix and will be suffixed by `-n` to ensure unique clients. It might be different in 1.3.5 - https://docs.spring.io/spring-kafka/reference/htmlsingle/#_client_id – kkflf Jul 10 '18 at 11:50
I'm learning too. That's interesting, actually, I debugged the solution with a blank new kotlin project with the newest version of spring-kafka 2.1.x. B/c it has better logging of the `client.id`s which are used, than 1.3.5. I shared with you this project so that you can take exactly a look if you have time. https://gitlab.com/oklaska/event-hubs-kafka-consumer – Roman T Jul 10 '18 at 12:16
yeah, thanks for looking at it. In the real world, it's mostly the case that you have grouped consumers as separate JVMs (containers). Concurrency inside of a JVM-consumer is also a good thing and works as you said without the issue. – Roman T Jul 10 '18 at 13:20
I thought you were running two consumers in the same application. After cloning the repo, I realized you are running one consumer, but the same application twice. I was 100% sure that the client-id had to be unique. I have just done some testing with Spring-Kafka 2.1.7 and running multiple applications with identical client-id.. This worked without any error - So apparently it is possible to use identical client-id in multiple applications. I deleted my previous two comments, they were missleading. – kkflf Jul 10 '18 at 13:23
maybe it's an events hubs issue in this case – Roman T Jul 10 '18 at 13:24
Kafka documentation states that the client-id has to be unique, such that you can track informations such as requests per client. Events-Hub might have something with prevents you from running multiple applications with the same client-id. – kkflf Jul 10 '18 at 13:27
I just ran your code without SASL_SSL configuration and on a localhost kafka. I used a topic with 3 partitions. I ran your application in two instances. Both instances shared the same group-id `mygrp` and default client-id `consumer-1`. It ran without any problems. Application A consumer was automatically assigned two partitions, and application B was automatically assigned one partition. It is therefore either a Events-Hub or SASL-SSL config error. I suggest you to set `logging.level.root: DEBUG` in your application.yaml and try some local testing with https://github.com/Landoop/fast-data-dev – kkflf Jul 10 '18 at 13:37

score 0 · Answer 2 · answered Jul 09 '18 at 12:40

0

You cannot have more consumers, using identical group-id, than the number of partitions for a given topic.

E.g. A topic with 3 partitions can have 1-3 consumers using the same group-id.

I assume your topic only got one partition and the two consumers keep fighting for this resource. You will either have to remove one of your consumers or add an additional partition to your topic.

answered Jul 09 '18 at 12:40

kkflf

2,435
3
26
42

Yes, it's the case, I have one partition in the topic. Ok, I'm starting getting it. So if I want to scale my consumers in the group, I necessarily need to reconfigure the number of partitions for the topic on kafka server. Is it convenient? – Roman T Jul 09 '18 at 12:43
Kafka will balance such that each consumer got atleast one partition assigned. E.g. 5 partitions and 3 consumers will result in 1 consumer being assigned 1 partition and 2 consumers being assigned 2 partitions each. But if you got 3 partitions and 5 consumers, then you will have 2 consumers without any partitions. – kkflf Jul 09 '18 at 12:51
Useful information: https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa – kkflf Jul 09 '18 at 12:54
You will have to pick the "right" amount of partitions for your use-case - There is no correct answer. You can also process your messages as a batch if you want better performance in your application. – kkflf Jul 09 '18 at 12:58
I get the same warnings with 5 partitions. The 2 consumers seem to fight again for some partitions with lots of Broken Pipe exceptions. Maybe it's issue of Spring Kafka? – Roman T Jul 09 '18 at 13:37
What development environment are you using? Can you double check the number of partitions? I have double checked your log and it looks very much like too many consumers compared to partitions. How did you increase the number of partitions? Can you try to change your group-id? – kkflf Jul 09 '18 at 14:07
yeah, double checked it. I create new topic through the azure portal console: https://hmp.me/brmk – Roman T Jul 09 '18 at 14:15
Sorry, I do not have a solution for you. I will try to think about your question and ill return if I figure out something. – kkflf Jul 09 '18 at 14:16
1

I don't know if it's relevant but [Richard Seroter blogged about his experiences](https://seroter.wordpress.com/2018/05/29/how-to-use-the-kafka-interface-of-azure-event-hubs-with-spring-cloud-stream/) getting Spring Cloud Stream (SCSt) to work with Event Hubs over Kafka (SCSt uses spring-kafka). He found some anomalies with group management; perhaps because (at least at the time) Azure's kafka implementation is in beta). – Gary Russell Jul 09 '18 at 19:45

Azure Event Hubs for Kafka with 2 consumers from the same group rebalance infinitely

2 Answers2