Kafka Consumer/Producer: is it safe to produce in the callback of a production? (max.poll.interval.ms error)

Question

TL;DR: Is there a limitation on re-producing to a topic as part of the callback of a previous production? Consumer hangs with

Application maximum poll interval (300000ms) exceeded by 72ms (adjust max.poll.interval.ms for long-running message processing): leaving group

I get an error on a consumer-producer process. Using python 3.9, confluent_kafka library, Producer/Consumer classes.

Process uses Consumer to pull data from topic.

It processes the events and uses the Producer to produce to a separate topic.

If there is a failure on production because the topic does not exist, the callback uses the AdminClient to create it (I have a dependency on the type of message to produce to a certain topic, creating a new topic if the type is coming first time).

The creation of the topic, and the production to the topic happens on the callback function.

When the topic is already created, the application works normally. However when the topic is not created:

2023-05-17T13:31:07.976+02:00   Attempted to produce message to non-existent topic: xxx
2023-05-17T13:31:07.976+02:00   Creating new topic xxx and reproducing message
2023-05-17T13:31:08.747+02:00   Topic "xxx" created
2023-05-17T13:31:09.901+02:00   Produced message on topic xxx
2023-05-17T13:35:35.285+02:00   %4|1684323335.285|MAXPOLL|process#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 72ms (adjust max.poll.interval.ms for long-running message processing): leaving group

In this situation, the default max.poll.interval.ms looks good enough, especially seeing that the topic is created and the message gets produced quick. It seems like since the production of the message, until the timeout happens, there are 4'20'' where the process goes idle, and so it never comes back?

Consumer commits happen after the production is finished (to do at-least-once)

Could it be because of an accumulation/conflict of callbacks? Any other idea?

score 0 · Answer 1 · answered May 17 '23 at 13:48

0

The limitation is in the consumer poll timeout. Creating a topic is a a blocking operation, and may timeout itself, causing the consumer not to heartbeat, and leave the consumer group

Similarly, don't block the producer by calling flush, for example.

You can work around these issues by increasing max poll interval, or pausing the consumer while the topic is created.

answered May 17 '23 at 13:48

OneCricketeer

179,855
19
132
245

Thanks! I see. I'll increase `max.poll.interval.ms` to cover for topic creation. About the topic creation blocking: I'll add the `operation_timeout` setting to leave some time for the adminClient to perform the operation to have more clarity. About producer: yes I call flush, since I want the consumer to commit after the production is done to ensure at-least-once. Is there a way to make it work differently? Is the producer also sensitive to that blocking? (I haven't found a similar timeout setting for Producer) – xmar May 17 '23 at 14:48
Increased `max.poll.interval.ms` to 900000 (15min, 3x default), to no avail. Same behaviour. Seems like it does not come back from the topic creation, even if the topic gets created pretty quickly, and it gets produced to. – xmar May 17 '23 at 17:41
For the producer, you could flush every ten records, for example, and/or flush when consumer reached end of a partition, or after some interval of not reading any data... I was saying that you don't need to flush for every record consumed. The AdminClient should be relatively quick, in my experience, less than minutes for sure. I have seen it take longer when there is authentication required, however, at least when using `kakfa-topics.sh` – OneCricketeer May 17 '23 at 17:51

Kafka Consumer/Producer: is it safe to produce in the callback of a production? (max.poll.interval.ms error)

1 Answers1