How can I set a maximum allowed execution time per task on Spark-YARN?

Question

I have a long-running PySpark Structured Streaming job, which reads a Kafka topic, does some processing and writes the result back to another Kafka topic. Our Kafka server runs on another cluster.

It's running fine but every few hours it freezes, even though in the web UI the YARN application still has status "running". After inspecting the logs, it seems due to some transient connectivity problem with the Kafka source. Indeed, all tasks of the problematic micro-batch have completed correctly, except one which shows:

21/08/11 19:19:59 INFO AbstractCoordinator: Discovered coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Marking the coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) dead for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Discovered coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Marking the coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) dead for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Discovered coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) for group spark-kafka-source-yyy
21/08/11 19:19:59 INFO AbstractCoordinator: Marking the coordinator my-kafka-xxx:9093 (id: 2147483646 rack: null) dead for group spark-kafka-source-yyy

The failure is not detected by Spark or YARN and the task runs forever (up to several days) and keeps on printing 10-20 such error messages per second. Restarting the process fixes the problem.

Is there a possibility to force the failure of the Spark task (on YARN) in such a situation? Then it would be restarted automatically and problem should be solved. Of course, any other way to restore the Kafka connection would be fine too...

I know it is possible to kill YARN containers based on a max acceptable memory consumption, but haven't seen anything similar for execution time.

[This question](https://stackoverflow.com/questions/50593008/kafka-resiliency-group-coordinator) is related to the root cause, but it's not solved and in my case I never get the 'auto offset commit failed' message — Pierre Gramme, Aug 12 '21 at 18:01
[Another post](https://community.cloudera.com/t5/Support-Questions/Error-in-kafka-consumer/td-p/67498) related to my root error, but that one also doesn't mention the infinite dead-discovered swapping — Pierre Gramme, Aug 12 '21 at 18:11
No single answer or comment despite the bounty? Life is tough... — Pierre Gramme, Aug 23 '21 at 10:13

Pierre Gramme · Accepted Answer · 2021-08-24T08:26:28.083

I haven't found a solution to do it with YARN, but a workaround using a monitoring loop in the Pyspark driver. The loop will check status regularly and fail the streaming app if status hasn't been updated for 10 minutes

MAX_DURATION = 10*60 # in seconds

df:DataFrame = define_my_data_stream(params)
writer:DataStreamWriter = write_to_my_kafka(df)

qy = writer.start()

prevBatch = -1
while not spark.streams.awaitAnyTermination(defaultMaxDuration):
    lastBatch = qy.lastProgress['batchId']
    if lastBatch == prevBatch:
        qy.stop()
        print("Query stopped")
        raise RuntimeError("Query '"+(qy.name or "")+"' ("+qy.id+") has stalled")
    else:
        prevBatch = lastBatch

Raising the exception will fail the Spark app. This failure can then be managed by YARN and the app restarted by using the following options to spark-submit:

--conf spark.yarn.maxAppAttempts=2 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=4 \
--conf spark.yarn.executor.failuresValidityInterval=1h \

It does work: freezing is detected and application is restarted from checkpoint. But it can be restarted only once, as if I hadn't specified the failuresValidityInterval parameters. That's another problem, known issue of Spark...

How can I set a maximum allowed execution time per task on Spark-YARN?

1 Answers1