KafkaStreams - recover stream after broker failure

Question

I've implemented a KafkaStreams app with the following properties

application.id = KafkaStreams
application.server = 
bootstrap.servers = [localhost:9092,localhost:9093]
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 10485760
client.id = 
commit.interval.ms = 30000
connections.max.idle.ms = 540000
default.key.serde = class org.apache.kafka.common.serialization.Serdes$StringSerde
default.timestamp.extractor = class org.apache.kafka.streams.processor.FailOnInvalidTimestamp
default.value.serde = class org.apache.kafka.common.serialization.Serdes$StringSerde
key.serde = null
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
num.standby.replicas = 0
num.stream.threads = 1
partition.grouper = class org.apache.kafka.streams.processor.DefaultPartitionGrouper
poll.ms = 100
processing.guarantee = at_least_once
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
replication.factor = 1
request.timeout.ms = 40000
retry.backoff.ms = 100
rocksdb.config.setter = null
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
state.cleanup.delay.ms = 600000
state.dir = /tmp/kafka-streams
timestamp.extractor = null
value.serde = null
windowstore.changelog.additional.retention.ms = 86400000
zookeeper.connect =

My kafka version is 0.11.0.1. I launched two kafka brokers on localhost:9092 and 9093 respectively. In both brokers default.replication.factor=2 and num.partitions=4 (the rest of configuration properties are default).

My app receives streaming data from a specific topic, makes some transformations and sends data back to another topic. As soon as the second broker is down, it stops receiving data printing the following:

INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Discovered coordinator localhost:9093 (id: 2147483646 rack: null) for group KafkaStreams.
[KafkaStreams-38259122-0ce7-41c3-8df6-7482626fec81-StreamThread-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Marking the coordinator localhost:9093 (id: 2147483646 rack: null) dead for group KafkaStreams
[KafkaStreams-38259122-0ce7-41c3-8df6-7482626fec81-StreamThread-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Discovered coordinator localhost:9093 (id: 2147483646 rack: null) for group KafkaStreams.
[KafkaStreams-38259122-0ce7-41c3-8df6-7482626fec81-StreamThread-1] WARN org.apache.kafka.clients.NetworkClient - Connection to node 2147483646 could not be established. Broker may not be available.
[kafka-coordinator-heartbeat-thread | KafkaStreams] WARN org.apache.kafka.clients.NetworkClient - Connection to node 1 could not be established. Broker may not be available.

For some reason it doesn't rebalance in order to connect to the first broker. Any suggestions why is this happening?

How long did you wait? Failover should happen eventually. What are you topic properties (bin/kafka-topic.sh --describe). It might also be, that the second broker cannot take over partition leadership if its not in-sync with the previous leader. — Matthias J. Sax, Oct 23 '17 at 17:55
I waited at least for five minutes. This is my topic properties before the broker failure: `Topic: export Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: export Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1 Topic: export Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: export Partition: 3 Leader: 0 Replicas: 0,1 Isr: 0,1` After the failure I have the same properties except for Isr and Leader which now contain only 0. — Alexandros Mavrommatis, Oct 24 '17 at 07:59
After an online searching, I get that the failover takes place, but the consumer related to the failed broker doesn't recognize the leader change. — Alexandros Mavrommatis, Oct 24 '17 at 08:29
(1) What is your topic config for min.in.sync.replicas? (2) After failure, all topic now have leader 0 and ISR 0 (just to confirm -- not 100% clear from your comment above)? (3) Consumers can take some time to pick up a leader change -- if they don't, please report to dev@kafka.apache.org and share the consumer and broker logs in debug mode, so dev team can have a look. Thx. — Matthias J. Sax, Oct 24 '17 at 16:33
(1) min.insync.replicas is default 1. (2) All topics in my case is just one topic and yes, it changed leader and ISR to 0. (3) Now I waited more than 5 minutes. I guess it should have done the leader change. — Alexandros Mavrommatis, Oct 25 '17 at 12:40
I agree. Not sure why it does not pick up the new leader. Maybe you can report this to user@kafka.apache.org and also attache Streams and Broker DEBUG level log files to dig into it. — Matthias J. Sax, Oct 25 '17 at 19:17
The solution is located here: https://stackoverflow.com/questions/48167430/kafka-consumer-fails-to-consume-if-first-broker-is-down — Alexandros Mavrommatis, Oct 31 '18 at 16:22

KafkaStreams - recover stream after broker failure

0 Answers0