Driver stops executors without a reason

Question

I have an application based on spark structured streaming 3 with kafka, which is processing some user logs and after some time the driver is starting to kill the executors and I don't understand why. The executors doesn't contain any errors. I'm leaving bellow the logs from executor and driver

On the executor 1:

0/08/31 10:01:31 INFO executor.Executor: Finished task 5.0 in stage 791.0 (TID 46411). 1759 bytes result sent to driver
20/08/31 10:01:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown

On the executor 2:

20/08/31 10:14:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
20/08/31 10:14:34 INFO memory.MemoryStore: MemoryStore cleared
20/08/31 10:14:34 INFO storage.BlockManager: BlockManager stopped
20/08/31 10:14:34 INFO util.ShutdownHookManager: Shutdown hook called

On the driver:

20/08/31 10:01:33 ERROR cluster.YarnScheduler: Lost executor 3 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 130392 ms

20/08/31 10:53:33 ERROR cluster.YarnScheduler: Lost executor 2 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 125773 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129308 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129314 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129311 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129305 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.

Is there anyone which had the same problem and solved it?

please check memory and cpu status in executor/master nodes. it might also happens when there is no memory in executor.... increase executor memory overhead and heap memory and also increase hartBeatInterval from executors to master. so that YarnScheduler will not mark them as lost executors easily and finally increase number of executors — kavetiraviteja, Aug 31 '20 at 10:26
I don't understand why I need to increase "spark.executor.heartbeatInterval" because increasing it, the executors are going to heartbeat the driver less. Also YARN doesn't show any important events. Is there another tool to check the memory / cpu ? . Thanks — M. Alexandru, Aug 31 '20 at 12:46
you can enable ***ganglia*** on the cluster and you can easily see CPU and memory stats ... I said to increase ***spark.executor.heartbeatInterval*** only because sometimes executors become too busy to acknowledge to driver .. too busy means cpu or memory pressure some times will be huge on executors and daemons thread which acknowlege to driver will not get cpu and memory to send heartBeats ... if driver will not get heartBeat in defined time it will mark executor as lost and initiate kill procedure... — kavetiraviteja, Aug 31 '20 at 12:58
Thanks for you explanation. I'm going to increase it. I'm going to check on ganglia. — M. Alexandru, Aug 31 '20 at 13:07

score 2 · Answer 1 · answered Oct 09 '20 at 15:24

Looking at the available information at hand:

no errors
Driver commanded a shutdown
Yarn logs showing "state FINISHED"

this seems to be expected behavior.

This typically happens if you forget to await the termination of the spark streaming query. If you do not conclude your code with

query.awaitTermination()

your streaming application will just shutdown after all data was processed.

Driver stops executors without a reason

1 Answers1