I have a long-running PySpark Structured Streaming job, which reads a Kafka topic, does some processing and writes the result back to another Kafka topic. Our Kafka server runs on another cluster.
It's running fine but every few hours it freezes, even though in the web UI the YARN application still has status "running". After inspecting the logs, it seems due to some transient connectivity problem with the Kafka source. Indeed, all tasks of the problematic micro-batch have completed correctly, except one which shows:
21/08/11 19:19:59 INFO AbstractCoordinator: Discovered coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Marking the coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) dead for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Discovered coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Marking the coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) dead for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Discovered coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Marking the coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) dead for group spark-kafka-source-yyy
The failure is not detected by Spark or YARN and the task runs forever (up to several days) and keeps on printing 10-20 such error messages per second. Restarting the process fixes the problem.
Is there a possibility to force the failure of the Spark task (on YARN) in such a situation? Then it would be restarted automatically and problem should be solved. Of course, any other way to restore the Kafka connection would be fine too...
I know it is possible to kill YARN containers based on a max acceptable memory consumption, but haven't seen anything similar for execution time.