Python: confluent-kafka (librdkafka) producer Timeout error

Question

I've got a 9 nodes kafka cluster hosted on AWS MKS and I'm using confluent-kafka library with python.

While producing to a topic I get too many Timeout errors like:

%5|1684850550.061|REQTMOUT|eb3004f09ce7#producer-1| [thrd:sasl_ssl://broker1.a]: sasl_ssl://broker1.amazonaws.com:9096/3: Timed out ProduceRequest in flight (after 1391ms, timeout #0): possibly held back by preceeding ProduceRequest with timeout in 58184ms

%3|1684850430.948|FAIL|f814d85051d8#producer-1| [thrd:sasl_ssl://broker1.a]: sasl_ssl://broker1.amazonaws.com:9096/3: 1 request(s) timed out: disconnect (after 4493ms in state UP)

%4|1684858183.537|REQTMOUT|6521878ea69c#producer-1| [thrd:sasl_ssl://broker1.a]: sasl_ssl://broker1.amazonaws.com:9096/3: Timed out 6 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests

My producer config is:

    "batch.size": 5_242_880,
    "client.id": socket.gethostname(),
    "compression.codec": "gzip",
    "linger.ms": 40,
    "message.max.bytes": 5_242_880,
    "batch.num.messages": 1_000_000,
    "queue.buffering.max.messages": 10_000_000,

My broker config is :

    auto.create.topics.enable = false
    group.initial.rebalance.delay.ms = 3
    log.retention.ms = 300000
    log.segment.bytes = 1073741824
    message.max.bytes = 10485760
    min.insync.replicas = 2
    num.io.threads = 32
    num.network.threads = 1500
    num.recovery.threads.per.data.dir = 8
    num.replica.fetchers = 2
    offsets.retention.minutes = 180
    offsets.topic.replication.factor = 3
    replica.fetch.max.bytes = 10485760
    replica.fetch.response.max.bytes = 10485760
    replica.socket.receive.buffer.bytes = 10485760
    socket.receive.buffer.bytes = 10485760
    socket.request.max.bytes = 104857600
    socket.send.buffer.bytes = 10485760
    transaction.state.log.min.isr = 1
    transaction.state.log.replication.factor = 3
    unclean.leader.election.enable = true
    zookeeper.connection.timeout.ms = 300000

confluent-kafka version: 2.1.1 (latest)

May you suggest me what settings I should adjust to avoid the problem? Could it be a back-pressure problem? (I'm dealing with a large amount of data per second)

I tried to adjust the above parameters without any results.

No, the authentication is SCRAM-SHA-512 (username and password). Actually many requests succeed while others get a timeout. — JustAnOtherNickname, Jun 05 '23 at 13:17
That's for SASL, right? (and where are you providing those?) But you're also using SSL/TLS, which requires certificates — OneCricketeer, Jun 05 '23 at 13:21
In any case, if you're batching several MB of data without calling `producer.flush()`, then it's possible for the batches to timeout, yes — OneCricketeer, Jun 05 '23 at 13:23
As `security.protocol` I'm using SASL_SSL. And yes I'm calling `producer.flush()` manually. — JustAnOtherNickname, Jun 05 '23 at 13:26

Python: confluent-kafka (librdkafka) producer Timeout error

0 Answers0