Suppose you have Spark + Standalone cluster manager. You opened spark session with some configs and want to launch SomeSparkJob
40 times in parallel with different arguments.
Questions
- How to set reties amount on job failures?
- How to restart jobs programmatically on failure? This could be useful if jobs failure due lack of resources. Than I can launch one by one all jobs that require extra resources.
- How to restart spark application on job failure? This could be useful if job lack resources even when it's launched simultaneously. Than to change cores, CPU etc configs I need to relaunch application in Standalone cluster manager.
My workarounds
1) I pretty sure the 1st point is possible, since it's possible at spark local mode. I just don't know how to do that in standalone mode.
2-3) It's possible to hand listener on spark context like spark.sparkContext().addSparkListener(new SparkListener() {
. But seems SparkListener
lacks failure callbacks.
Also there is a bunch of methods with very poor documentation. I've never used them, but perhaps they could help to solve my problem.
spark.sparkContext().dagScheduler().runJob();
spark.sparkContext().runJob()
spark.sparkContext().submitJob()
spark.sparkContext().taskScheduler().submitTasks();
spark.sparkContext().dagScheduler().handleJobCancellation();
spark.sparkContext().statusTracker()