2

I am facing a very weird issue with Kafka streams, under heavy load when a rebalancing happens my kafka streams application keep getting stuck with the following error showing up in logs repeatedly:

org.apache.kafka.streams.errors.LockException: stream-thread [metricsvc-metric-space-aggregation-9f4389a2-85de-43dc-a45c-3d4cc66150c4-StreamThread-1] task [0_13] Failed to lock the state directory for task 0_13
    at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:91) ~[kafka-streams-2.8.1.jar:?]
    at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:216) ~[kafka-streams-2.8.1.jar:?]
    at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:433) ~[kafka-streams-2.8.1.jar:?]
    at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:849) ~[kafka-streams-2.8.1.jar:?]
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:731) ~[kafka-streams-2.8.1.jar:?]
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:583) ~[kafka-streams-2.8.1.jar:?]
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:556) ~[kafka-streams-2.8.1.jar:?]

I am debugging some old code written by a developer in our org who is no longer with our company and this part is running into some issues. Unfortunately the code is not very well documented. In this part of the code he has tried to override some of the kafka streams WindowedStore and ReadOnlyWindowedStore classes for optimazation. I understand it is quite difficult to find the root cause without looking at the complete code but is there something really obvious that I should be looking at to solve this?

I am currently running 4 kubernetes pods for this service and all of them have their independent state directory.

I am expecting to not get the error above and even if it happens kafka streams should recover from this error gracefully, but it doesn't happen in our case.

1 Answers1

2

Are there multiple StreamThread instances per POD? Then you could be affected by https://issues.apache.org/jira/browse/KAFKA-12679

Lucas Brutschy
  • 622
  • 4
  • 9
  • Yes there are. And comparing our logs we seem to be running into the exact same issue. We also have CPU spikes happening at the exact same time so the starvation of the old stream thread also makes sense. The bug is marked as fix version 3.4.0. Any idea when that is coming out? – birinder tiwana Nov 01 '22 at 15:42
  • Looks like somebody contributed a patch, but nobody until now submitted the change as a PR and got it accepted. So this being an open source project, I suppose the best bet you have to getting this fixed is submitting a PR and getting it accepted yourself. – Lucas Brutschy Nov 01 '22 at 16:25
  • Yes, trying to do that. Thanks! – birinder tiwana Nov 02 '22 at 18:26
  • Interestingly this issue occurs in only on of our 20 services and could be worked around by reducing `num.stream.threads` to `1` – Andras Hatvani Jun 13 '23 at 08:59