I am facing a very weird issue with Kafka streams, under heavy load when a rebalancing happens my kafka streams application keep getting stuck with the following error showing up in logs repeatedly:
org.apache.kafka.streams.errors.LockException: stream-thread [metricsvc-metric-space-aggregation-9f4389a2-85de-43dc-a45c-3d4cc66150c4-StreamThread-1] task [0_13] Failed to lock the state directory for task 0_13
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:91) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:216) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:433) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:849) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:731) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:583) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:556) ~[kafka-streams-2.8.1.jar:?]
I am debugging some old code written by a developer in our org who is no longer with our company and this part is running into some issues. Unfortunately the code is not very well documented. In this part of the code he has tried to override some of the kafka streams WindowedStore and ReadOnlyWindowedStore classes for optimazation. I understand it is quite difficult to find the root cause without looking at the complete code but is there something really obvious that I should be looking at to solve this?
I am currently running 4 kubernetes pods for this service and all of them have their independent state directory.
I am expecting to not get the error above and even if it happens kafka streams should recover from this error gracefully, but it doesn't happen in our case.