I have an application using kafka and taking advantage of two separate consumer groups listening to one topic where one consumer group (C1) is always listening for messages and the other consumer group (C2) comes online and starts listening for messages then goes offline again for some time.
More specifically, the code that is always listening on consumer group C1 responds to a message by creating a virtual machine that starts listening on C2 and does some work using costly hardware.
The problem I'm running into is that after the virtual machine is spun up and listening on consumer group C2 commences it will sometimes receive nothing, despite the fact that it should be receiving the same message that C1 received causing C2 to be listened on in the first place.
I'm using the following topic, producer, and consumer configs:
topic config:
partitions: 6
compression.type: producer
leader.replication.throttled.replicas: --
message.downconversion.enable: true
min.insync.replicas: 2
segment.jitter.ms: 0
cleanup.policy: delete
flush.ms: 9223372036854775807
follower.replication.throttled.replicas: --
segment.bytes: 104857600
retention.ms: 604800000
flush.messages: 9223372036854775807
message.format.version: 3.0-IV1
max.compaction.lag.ms: 9223372036854775807
file.delete.delay.ms: 60000
max.message.bytes: 8388608
min.compaction.lag.ms: 0
message.timestamp.type: CreateTime
preallocate: false
min.cleanable.dirty.ratio: 0.5
index.interval.bytes: 4096
unclean.leader.election.enable: false
retention.bytes: -1
delete.retention.ms: 86400000
segment.ms: 604800000
message.timestamp.difference.max.ms: 9223372036854775807
segment.index.bytes: 10485760
producer config:
("message.max.bytes", "20971520")
("queue.buffering.max.ms", "0")
consumer config:
("enable.partition.eof", "false")
("session.timeout.ms", "6000")
("enable.auto.commit", "true")
("auto.commit.interval.ms", "5000")
("enable.auto.of.store", "true")
The bug is intermittent. Sometimes it occurs, sometimes it doesn't and resending the exact same message after the consumer is up and listening on C2 always succeeds, so it isn't some issue like the message size being too large for the topic or anything like that.
I suspect it's related to offsets being committed/stored improperly. My topic configuration uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow and thus the new message that triggered C2's listening is being missed since it isn't the "latest" by kafka's accounting. The work done by the code listening on consumer group C2 is quite long-running and the consumer often reports a timeout, so maybe that's contributing?
EDIT: The timeout error I get is exactly:
WARN - librdkafka - librdkafka: MAXPOLL [thrd:main]: Application maximum poll interval (300000ms) exceeded by 424ms (adjust max.poll.interval.ms for long-running message processing): l
eaving group
I am using the Rust rdkafka library for both the producer and consumer with confluent cloud's hosted kafka.