Kafka Resiliency - Group Coordinator

Question

As I understand, one of the brokers is selected as the group coordinator which takes care of consumer rebalancing.

Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group

I have 3 nodes with replication factor of 3 and 3 partitions. Everything is great and when I kill kafka on non-coordinator nodes, consumer is still receiving messages.

But when I kill that specific node with coordinator, rebalancing is not happening and my java consumer app does not receive any messages.

2018-05-29 16:34:22.668 INFO  AbstractCoordinator:555 - Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group.
2018-05-29 16:34:22.689 INFO  AbstractCoordinator:600 - Marking the coordinator host:9092 (id: 2147483646 rack: null) dead for group good_group
2018-05-29 16:34:22.801 INFO  AbstractCoordinator:555 - Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group.
2018-05-29 16:34:22.832 INFO  AbstractCoordinator:600 - Marking the coordinator host:9092 (id: 2147483646 rack: null) dead for group good_group
2018-05-29 16:34:22.933 INFO  AbstractCoordinator:555 - Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group.
2018-05-29 16:34:23.044 WARN  ConsumerCoordinator:535 - Auto offset commit failed for group good_group: Offset commit failed with a retriable exception. You should retry committing offsets.

Am I doing something wrong and is there a way around this?

if u look at the logs , it almost feels like its bad connectivity to the group co-ordinator. Its discovering it, and then marking it dead on repeat. — Indraneel Bende, May 30 '18 at 01:49
https://stackoverflow.com/questions/35636739/kafka-consumer-marking-the-coordinator-2147483647-dead Maybe u could get some useful information from this. — Indraneel Bende, May 30 '18 at 01:58

Quang Vien · Answer 1 · 2018-05-30T05:52:23.073

5

But when I kill that specific node with coordinator, rebalancing is not happening and my java consumer app does not receive any messages.

The group coordinator receives heartbeats from all consumers in the consumer group. It maintains a list of active consumers and initiates the rebalancing on the change of this list. Then the group leader executes the rebalance activity.

That's why the rebalancing will stop if you kill the group coordinator.

UPDATE

In the case that the group coordinator broker shutdowns, the Zookeeper will be notified and the election starts to promote a new group coordinator from the active brokers automatically. So nothing to do with group coordinator. Let's see the log:

2018-05-29 16:34:23.044 WARN  ConsumerCoordinator:535 - Auto offset commit failed for group good_group: Offset commit failed with a retriable exception. You should retry committing offsets.

The replication factor of internal topic __consumer_offset probably has the default value 1. Can you check what value of default.replication.factor and offsets.topic.replication.factor are in the server.properties files. If the values is 1 by default, it should be changed to bigger one. Failing to do so, the group coordinator shutdowns causing offset manager stops without backup. So the activity of committing offsets can not be done.

edited May 30 '18 at 05:52

answered May 30 '18 at 02:19

Quang Vien

308
2
10

Thanks. So what is the workaround? it doesn't seem resilient at all if that's the case. What's the point of having a cluster of 100 nodes if 50th node has coordinator and it dies. The other 50 nodes are useless as far as resiliency goes. I will have to restart my java app again. – Anton Kim May 30 '18 at 02:25
I've just updated my above answer. Please check the values of default.replication.factor and offsets.topic.replication.factor if they are 1 then change to 3, for example. – Quang Vien May 30 '18 at 05:48
That topic is 3 replication factor – Anton Kim May 30 '18 at 05:52
1

May I know what are values of default.replication.factor and offsets.topic.replication.factor in your server.properties files ? It's not about your own topic but the internal topic named __consumer_offset generated by Kafka to manage committing offsets. – Quang Vien May 30 '18 at 05:53
doesn't help, just another WARN messages are logged on the consumer side: "Connection to node 0 could not be established. Broker may not be available." – MeetJoeBlack Apr 03 '19 at 19:35
1

You might have to increase the partitions on the `__consumer_offset` logs; it looks like they are not increased automatically - I blew all my logs away and started with a clean broker and the new logs got 3 replicas and my consumer recovers ok now. – Gary Russell Jun 28 '19 at 19:21

Kafka Resiliency - Group Coordinator

1 Answers1

Linked