How can I increase failure tolerance on yarn? In a busy cluster my job fails due to too many failures. Most of the failures were due to Executor lost
base by preemption.
how can I increase failure tolerance for spark jobs on yarn? Job failed due to too many preemntions?
Asked
Active
Viewed 666 times
0

Georg Heiler
- 16,916
- 36
- 162
- 292
2 Answers
1
If you have preemption enabled you really should be using the external shuffle service to avoid these issues. There's really not much that can be done otherwise.
https://issues.apache.org/jira/browse/SPARK-14209 - JIRA talks about.

Ashkrit Sharma
- 627
- 5
- 7
-
Is this specific to spark? Why does hive seem to be not affected? Without the external shuffle service still works fine? – Georg Heiler Feb 15 '19 at 15:54
-
Also, is there something like `spark.task.maxFailures` for stage failures? – Georg Heiler Feb 16 '19 at 06:09
-
`spark.stage.maxConsecutiveAttempts` looks like a viable workaround. – Georg Heiler Feb 16 '19 at 06:18
0
Close yarn preemption?Or run smaller jobs to avoid complete recomputation?

Yves
- 91
- 9
-
well that's definitely an option but not desirable ( both) nor do I have permissions to self-manage yarn queue settings ;) . Are there other possibilities? `spark.yarn.max.executor.failures` sounds interesting. – Georg Heiler Feb 15 '19 at 13:06