Kafka consumer that is spun-up and torn down misses message

Question

I have an application using kafka and taking advantage of two separate consumer groups listening to one topic where one consumer group (C1) is always listening for messages and the other consumer group (C2) comes online and starts listening for messages then goes offline again for some time.

More specifically, the code that is always listening on consumer group C1 responds to a message by creating a virtual machine that starts listening on C2 and does some work using costly hardware.

The problem I'm running into is that after the virtual machine is spun up and listening on consumer group C2 commences it will sometimes receive nothing, despite the fact that it should be receiving the same message that C1 received causing C2 to be listened on in the first place.

I'm using the following topic, producer, and consumer configs:

topic config:

partitions: 6
compression.type: producer
leader.replication.throttled.replicas: --
message.downconversion.enable: true
min.insync.replicas: 2
segment.jitter.ms: 0
cleanup.policy: delete
flush.ms: 9223372036854775807
follower.replication.throttled.replicas: --
segment.bytes: 104857600
retention.ms: 604800000
flush.messages: 9223372036854775807
message.format.version: 3.0-IV1
max.compaction.lag.ms: 9223372036854775807
file.delete.delay.ms: 60000
max.message.bytes: 8388608
min.compaction.lag.ms: 0
message.timestamp.type: CreateTime
preallocate: false
min.cleanable.dirty.ratio: 0.5
index.interval.bytes: 4096
unclean.leader.election.enable: false
retention.bytes: -1
delete.retention.ms: 86400000
segment.ms: 604800000
message.timestamp.difference.max.ms: 9223372036854775807
segment.index.bytes: 10485760

producer config:

("message.max.bytes", "20971520")
("queue.buffering.max.ms", "0")

consumer config:

("enable.partition.eof", "false")
("session.timeout.ms", "6000")
("enable.auto.commit", "true")
("auto.commit.interval.ms", "5000")
("enable.auto.of.store", "true")

The bug is intermittent. Sometimes it occurs, sometimes it doesn't and resending the exact same message after the consumer is up and listening on C2 always succeeds, so it isn't some issue like the message size being too large for the topic or anything like that.

I suspect it's related to offsets being committed/stored improperly. My topic configuration uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow and thus the new message that triggered C2's listening is being missed since it isn't the "latest" by kafka's accounting. The work done by the code listening on consumer group C2 is quite long-running and the consumer often reports a timeout, so maybe that's contributing?

EDIT: The timeout error I get is exactly:

WARN - librdkafka - librdkafka: MAXPOLL [thrd:main]: Application maximum poll interval (300000ms) exceeded by 424ms (adjust max.poll.interval.ms for long-running message processing): l
eaving group

I am using the Rust rdkafka library for both the producer and consumer with confluent cloud's hosted kafka.

Consumer group offsets are deleted after a consumer group has no consumers for a sufficient amount of time. In older Kafka versions, the default retention is 1 day. In Kafka 2.0, the [default retention was increased to 7 days](https://cwiki.apache.org/confluence/display/KAFKA/KIP-186%3A+Increase+offsets+retention+default+to+7+days). — Chin Huang, Dec 01 '22 at 19:24
This was actually what I initially suspected was the culprit, but I then observed this issue right after the consumer group C2 had gone through a cycle of: being spun up, handling a message and completing some work, and spinning down. Less than 1 hour after doing that it failed to receive a message after being spun up again. — Ethan, Dec 01 '22 at 19:52
This sounds like a situation where you don't want to use `enable.auto.commit` but manually commit after your costly operation is finished. I doubt it is related to the cause as the message that gets lost is probably sent long after C2 is spun down? Also, I think should avoid the consumer timeouts by [pausing](https://docs.rs/rdkafka/0.29.0/rdkafka/consumer/trait.Consumer.html#tymethod.pause) the consumer or polling in a separate thread. — Caesar, Dec 02 '22 at 01:55
What exactly does your consumer code do? Perhaps you should enqueue the event elsewhere? Then pause that consumer while that long-running process happens, and then get a notification callback (i.e. a secondary consumer waiting for completion events, that's always polling), then start the next main consumer thread, and repeat... Also, I don't think you need to modify any flush or timestamp settings to be such large numbers. Your error is explicitly about the poll interval because your consumer does "too much" / waits "too long" before polling again. — OneCricketeer, Dec 02 '22 at 22:19
@Caesar I had been thinking about switching to manual commits, but I want to know that the problem can't be solved more easily. The message is getting lost between when C1 receives it and C2 is spun up. — Ethan, Dec 03 '22 at 20:42
@OneCricketeer that sounds a bit complicated, is all that necessary to get what seems like a simple pattern like the one I have to work? I've increased the field "max.poll.interval.ms" to 12 hours and that **appears** to have fixed things. My theory is the consumer disconnected after the poll interval expired and never updated the offsets in the partitions, leaving them empty and subject to the "latest offset" the next time a member of C2 is spun up. Basically I think the offsets for C2 were always empty since the timeout was pretty much always hit. I'm monitoring to ensure it works now. — Ethan, Dec 03 '22 at 20:44
It's only complicated because because you're trying to do too much (blocking) work on your consumer. 12 hours is definitely not recommended. The value you had before was 300 **seconds**, and the error said you exceeded it by less than **half a second**. You simply double it to 600 seconds, or just add 1 minute. Definitely not hours. Or, like I said, pause the consumer as soon as you get an event. Then process the data, then poll again on your own time. You should definitely commit on your own, though — OneCricketeer, Dec 03 '22 at 21:05
@OneCricketeer Thanks for explaining! The work that is done is highly dependent on the message, it can indeed take hours to complete sometimes, hence why I jumped all the way to 12 hours. You say 12 hours is not recommended, but what are the drawbacks of such a pattern? If simply bumping the max poll interval to 12 hours can fix my issue without weird future bugs I'd prefer that as opposed to pausing and handling manual commits and handling the potential errors that come with them. — Ethan, Dec 05 '22 at 17:59
The drawback is that a consume crashing within a few minutes is less likely to happen than a few hours. Kafka is best used on processes in the order of several minutes at most, rather than request-response, blocking, or hour/day long orchestration. That's not to say you cannot produce new events when long-processes finish, but waiting for a long-process to finish before polling new records is a poor design (from what I've experienced) — OneCricketeer, Dec 05 '22 at 19:23
@OneCricketeer is there an answer you think should be posted here? or is there an existing answer that explains the behavior? I'm doing a bit of SO cleanup but I'm a Rust SME and this problem doesn't sound Rust-specific. — kmdreko, Jan 07 '23 at 20:55
@kmdreko configuration is specific to the rdkafka rust crate — OneCricketeer, Jan 08 '23 at 15:34

score 0 · Answer 1 · answered Jan 08 '23 at 15:31

uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow

That has nothing to do with committed values. Only where you start reading for a unique group id.

You have auto commits enabled, but you're getting errors. Therefore offsets are getting committed, but you're not successfully processing data. That's why there's skips.

Your error,

maximum poll interval (300000ms) exceeded by 424ms

Without seeing your consumer code, you'll need to do "slightly less" within your poll function. For example, removing a log line could reduce half a second, easily, assuming log statement takes 1ms and you're pooling 500 records each time.

Otherwise, increasing max.poll.interval.ms (allow consumer heartbeat to wait longer) or reducing max.poll.records (process less data per heartbeat, but poll more frequently) is the correct response to this error.

Kafka consumer that is spun-up and torn down misses message

1 Answers1