0

I'm using kafka on docker (note : with wurstmeister image). I have three consumers groups (each group contains 3 consumers).

Two groups are running without problems, the last one added is running bad (many rebalancing on kafka logs) : consumer code is same has for group1 (copy/paste for KafkaConsumer code)

I added logs on my 3 consumers of group3, but I can't determinate what is the problem :

2020-08-10 10:00:35 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 191: HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:35 DEBUG    Received correlation id: 191
2020-08-10 10:00:35 DEBUG    Processing response HeartbeatResponse_v1
2020-08-10 10:00:35 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 191 (101.25017166137695 ms): HeartbeatResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:35 DEBUG    Received successful heartbeat response for group group3
2020-08-10 10:00:37 DEBUG    Heartbeat: group3[2] kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547
2020-08-10 10:00:37 DEBUG    Sending request HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:37 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 192: HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:37 DEBUG    Received correlation id: 192
2020-08-10 10:00:37 DEBUG    Processing response HeartbeatResponse_v1
2020-08-10 10:00:37 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 192 (101.10831260681152 ms): HeartbeatResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:37 DEBUG    Received successful heartbeat response for group group3
2020-08-10 10:00:39 DEBUG    Heartbeat: group3[2] kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547
2020-08-10 10:00:39 DEBUG    Sending request HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:39 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 193: HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:40 DEBUG    Received correlation id: 193
2020-08-10 10:00:40 DEBUG    Processing response HeartbeatResponse_v1
2020-08-10 10:00:40 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 193 (101.21679306030273 ms): HeartbeatResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:40 DEBUG    Received successful heartbeat response for group group3
2020-08-10 10:00:41 DEBUG    Heartbeat: group3[2] kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547
2020-08-10 10:00:42 DEBUG    Sending request HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:42 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 194: HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:42 DEBUG    Received correlation id: 194
2020-08-10 10:00:42 DEBUG    Processing response HeartbeatResponse_v1
2020-08-10 10:00:42 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 194 (103.01470756530762 ms): HeartbeatResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:42 DEBUG    Received successful heartbeat response for group group3
2020-08-10 10:00:43 DEBUG    Heartbeat: group3[2] kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547
2020-08-10 10:00:43 DEBUG    Sending request HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:43 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 195: HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:44 DEBUG    Received correlation id: 195
2020-08-10 10:00:44 DEBUG    Processing response HeartbeatResponse_v1
2020-08-10 10:00:44 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 195 (101.14407539367676 ms): HeartbeatResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:44 DEBUG    Received successful heartbeat response for group group3
2020-08-10 10:00:45 DEBUG    Heartbeat: group3[2] kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547
2020-08-10 10:00:45 DEBUG    Sending request HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:45 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 196: HeartbeatRequest_v1(group='group3', generation_id=2, member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:46 DEBUG    Received correlation id: 196
2020-08-10 10:00:46 DEBUG    Processing response HeartbeatResponse_v1
2020-08-10 10:00:46 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 196 (100.91781616210938 ms): HeartbeatResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:46 DEBUG    Received successful heartbeat response for group group3
2020-08-10 10:00:47 WARNING  Heartbeat poll expired, leaving group
2020-08-10 10:00:47 INFO     Leaving consumer group (group3).
2020-08-10 10:00:47 DEBUG    Sending request LeaveGroupRequest_v1(group='group3', member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:47 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Request 197: LeaveGroupRequest_v1(group='group3', member_id='kafka-python-2.0.1-8828091f-190e-49bb-88e3-1e702b221547')
2020-08-10 10:00:47 DEBUG    Received correlation id: 197
2020-08-10 10:00:47 DEBUG    Processing response LeaveGroupResponse_v1
2020-08-10 10:00:47 DEBUG    <BrokerConnection node_id=coordinator-1001 host=kafka:9092 <connected> [IPv4 ('172.18.0.6', 9092)]> Response 197 (11.553049087524414 ms): LeaveGroupResponse_v1(throttle_time_ms=0, error_code=0)
2020-08-10 10:00:47 DEBUG    LeaveGroup request for group group3 returned successfully
2020-08-10 10:00:47 DEBUG    Group state is not stable, disabling heartbeats
2020-08-10 10:00:47 DEBUG    Heartbeat disabled. Waiting

Someone has an idea ? Thank you !

EDIT : consumer configuration

self.consumer = KafkaConsumer(outTopic,
                bootstrap_servers=brokers,
                auto_offset_reset='earliest',
                enable_auto_commit=True,
                session_timeout_ms=14000,
                heartbeat_interval_ms=2000,
                group_id='group3',
                key_deserializer=lambda x: loads(x.decode('utf-8')),
                value_deserializer=lambda x: loads(x.decode('utf-8')))

EDIT2 : WARNING displayed in logs (filtered) => heartbeat fails every 2s and on 5min poll heartbeat is failing

2020-08-11 08:05:02 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:05:04 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:05:06 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:05:08 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:05:10 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:05:12 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:05:14 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:10:15 WARNING  Heartbeat poll expired, leaving group
2020-08-11 08:12:45 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:12:47 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:12:49 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:12:51 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:12:53 WARNING  Heartbeat failed for group group3 because it is rebalancing
2020-08-11 08:12:55 WARNING  Heartbeat failed for group group3 because it is rebalancing

EDIT : rebalancing (I think due to poll failling) on kafka container side :

[2020-08-12 12:41:50,514] INFO [GroupCoordinator 1001]: Preparing to rebalance group group3 in state PreparingRebalance with old generation 115 (__consumer_offsets-38) (reason: removing member kafka-python-2.0.1-3d0acd18-76d4-4c6c-a1c6-a5e202a3d7c2 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:41:50,547] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id kafka-python-2.0.1-f5e4662b-d1d0-45c0-93db-c0f131a84b8b] in group group3 has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:41:50,564] INFO [GroupCoordinator 1001]: Group group3 with generation 116 is now empty (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:42:40,681] INFO [GroupCoordinator 1001]: Dynamic Member with unknown member id joins group group3 in Empty state. Created a new member id kafka-python-2.0.1-a6120fb6-a528-46aa-9d6c-632a6343fd2c for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:42:40,683] INFO [GroupCoordinator 1001]: Preparing to rebalance group group3 in state PreparingRebalance with old generation 116 (__consumer_offsets-38) (reason: Adding new member kafka-python-2.0.1-a6120fb6-a528-46aa-9d6c-632a6343fd2c with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:42:40,686] INFO [GroupCoordinator 1001]: Stabilized group group3 generation 117 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:42:40,692] INFO [GroupCoordinator 1001]: Assignment received from leader for group group3 for generation 117 (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:42:45,578] INFO [GroupCoordinator 1001]: Dynamic Member with unknown member id joins group group3 in Stable state. Created a new member id kafka-python-2.0.1-dc5e96eb-f612-4e14-9a06-62cc39ea6485 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:42:45,579] INFO [GroupCoordinator 1001]: Preparing to rebalance group group3 in state PreparingRebalance with old generation 117 (__consumer_offsets-38) (reason: Adding new member kafka-python-2.0.1-dc5e96eb-f612-4e14-9a06-62cc39ea6485 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:47:40,547] INFO [GroupMetadataManager brokerId=1001] Removed 0 expired offsets in 3 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-08-12 12:47:45,584] INFO [GroupCoordinator 1001]: Member kafka-python-2.0.1-dc5e96eb-f612-4e14-9a06-62cc39ea6485 in group group3 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:47:45,616] INFO [GroupCoordinator 1001]: Dynamic Member with unknown member id joins group group3 in PreparingRebalance state. Created a new member id kafka-python-2.0.1-e2bbe292-9f91-4777-926a-de377e7eb543 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:50:00,039] INFO [GroupCoordinator 1001]: Stabilized group group3 generation 118 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2020-08-12 12:50:00,078] INFO [GroupCoordinator 1001]: Assignment received from leader for group group3 for generation 118 (kafka.coordinator.group.GroupCoordinator)

Note : poll failing and rebalcning occurs after many hours of running

Chris
  • 234
  • 3
  • 16
  • Could you please share more details? Eg. What are the consumer configurations? – Ali Can Aug 10 '20 at 13:36
  • Edited first post with consumer conf, thank you – Chris Aug 10 '20 at 13:59
  • It seems like a bug in kafka-python client [kafka-python/issues/1418](https://github.com/dpkp/kafka-python/issues/1418) Also, do you know how long does it take to process single message? Because, your consumers are expected to finish their job until next`.poll()` request has been made. There are two configuration parameters that you can set [max.poll.interval.ms](https://docs.confluent.io/current/installation/configuration/consumer-configs.html#max.poll.interval.ms) and [max.poll.records](https://docs.confluent.io/current/installation/configuration/consumer-configs.html#max.poll.records) – Ali Can Aug 10 '20 at 15:29
  • I'm using kafka-python 2.0.1 not 1.4.x and I already tried to reduce max.poll.records. I added in first post logs filtered on WARNING => I get many heartbeat fail due to rebalancing, is it possible my poll is failing for the same reason (rebalancing) ? – Chris Aug 11 '20 at 08:29
  • Did you check broker logs (docker container logs) ? – Ali Can Aug 11 '20 at 08:53
  • Yes, many rebalancing (certainly due to poll fail) : added on first post. Not : poll fail and rebalancing are coming after many hours of running stack. RAM seems OK... – Chris Aug 12 '20 at 12:56
  • How many partitions do you have on `outTopic` ? – Ali Can Aug 12 '20 at 14:36
  • 3 partitions in the outTopic – Chris Aug 12 '20 at 16:07
  • Did you try to run group3 with single instance? kafka python api documentation says "The KafkaProducer can be used across threads without issue, unlike the KafkaConsumer which cannot. While it is possible to use the KafkaConsumer in a thread-local manner, multiprocessing is recommended." Can that be the reason ? – Ali Can Aug 13 '20 at 10:20
  • Each consumer is on his own docker container (=like each consumer is on different server). On top of that, my group1 is on same logic : 3 docker containers which consuming 3 partitions and no problem with it... – Chris Aug 13 '20 at 13:04
  • I tried to reduce to only 1 partition and 1 consumer => same problem.. – Chris Aug 17 '20 at 08:34

0 Answers0