0
Exception Stacktrace:
org.springframework.kafka.core.KafkaProducerException: Failed to send; nested exception is org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for ****-656 due to 30037 ms has passed since batch creation plus linger time
      at org.springframework.kafka.core.KafkaTemplate$1.onCompletion(KafkaTemplate.java:255) ~[spring-kafka-1.1.6.RELEASE.jar!/:?]
      at org.apache.kafka.clients.producer.internals.RecordBatch.done(RecordBatch.java:109) ~[kafka-clients-0.10.1.1.jar!/:?]
      at org.apache.kafka.clients.producer.internals.RecordBatch.maybeExpire(RecordBatch.java:160) ~[kafka-clients-0.10.1.1.jar!/:?]
      at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortExpiredBatches(RecordAccumulator.java:245) ~[kafka-clients-0.10.1.1.jar!/:?]
      at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:212) ~[kafka-clients-0.10.1.1.jar!/:?]
      at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:135) ~[kafka-clients-0.10.1.1.jar!/:?]
      at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]

Received above exception in PROD environment on very first day of deployment for some of the kafka messages. Backout the changes from PROD. In Stage env, I never seen that exception while testing. Once I am able to reproduce the exception but that was only once, I might have ran 10 times. Now I don't have any direction on How to find RCA for this issue?

I am posting the Kafka Sender Configuration as below,

retries=3
retryBackoffMS=500
lingerMS=30
autoFlush=true
acksConfig=all
kafkaServerConfig=***<Can't post here>
reconnectBackoffMS=200
compressionType=snappy
batchSize=1000000
maxBlockMS=500000
        <dependency>
            <groupId>org.springframework.kafka</groupId>
            <artifactId>spring-kafka</artifactId>
            <version>1.1.8.RELEASE</version>
        </dependency>
amitwdh
  • 661
  • 2
  • 9
  • 19

1 Answers1

0

Th exception basically says the records in the buffer reaches the timeout.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer?source=post_page-----fa3910d9aa54----------------------#KIP-91ProvideIntuitiveUserTimeoutsinTheProducer-TestPlan

In stg you don't see this exception is because prod env is busier.

Can you update your spring-kafka version? Your kafka client is far behind the newest version. https://mvnrepository.com/artifact/org.springframework.kafka/spring-kafka/1.1.8.RELEASE that uses kafka 0.10.x and now is already 2.3.x

If you can use the newest version, you can set delivery.timeout.ms higher

If you cannot upgrade to a newer version, you have to play around linger.ms and request.timeout.ms (Try increasing them)

Besides that, the default retries is max integer, and apparently your retries: 3 would not be very practical. If you don't want to reconnect all the time, 30 is more practical. https://docs.confluent.io/current/installation/configuration/producer-configs.html or https://kafka.apache.org/documentation/#producerconfigs

Note that both links point to the current version

Holm
  • 2,987
  • 3
  • 27
  • 48
  • I have autoFlush=true, so linger.ms and batch size will not matter right? As soon as there are messages to send, kafka sender will flush it to network. Set autoFlush to true if you have configured the producer's linger.ms to a non-default value and wish send operations on this template to occur immediately, regardless of that setting, or if you wish to block until the broker has acknowledged receipt according to the producer's acks property. Source: https://docs.spring.io/spring-kafka/api/org/springframework/kafka/core/KafkaTemplate.html – amitwdh Oct 18 '19 at 11:41
  • How I will make sure that same issue will not come in Production? – amitwdh Oct 18 '19 at 11:46
  • you are right. autoFlush simply means linger.ms will not function. But to resolve the issue you face, you would have to not send record immediately so the application is much more performant. To see this issue in your stg envrionment, you would have to mirror your live kafka (try mirroring only the topics your app requires) using mirror-maker, or test your app in stg but connected to your live kafka brokers(if you don't produce to kafka, or produce to a test topic) – Holm Oct 18 '19 at 14:25
  • I already tried the same. In stage env, I pointed to Prod kafka brokers and tried sending to test topic. Still it did not re-produce that issue. Now thing is we don't know what is exactly issue? And we can't take risk again to deploy changes in PROD, there is no guarantee that this issue will not come in prod again. and even leadership team will also not allow to deploy changes in PROD without RCA. Kafka Sender process ~8Lakh records in ~10 minutes, Is that too much? – amitwdh Oct 19 '19 at 15:02