how can I increase failure tolerance for spark jobs on yarn? Job failed due to too many preemntions?

Question

How can I increase failure tolerance on yarn? In a busy cluster my job fails due to too many failures. Most of the failures were due to Executor lost base by preemption.

score 1 · Accepted Answer · answered Feb 15 '19 at 15:21

1

If you have preemption enabled you really should be using the external shuffle service to avoid these issues. There's really not much that can be done otherwise.

https://issues.apache.org/jira/browse/SPARK-14209 - JIRA talks about.

answered Feb 15 '19 at 15:21

Ashkrit Sharma

627
5
7

Is this specific to spark? Why does hive seem to be not affected? Without the external shuffle service still works fine? – Georg Heiler Feb 15 '19 at 15:54
Also, is there something like `spark.task.maxFailures` for stage failures? – Georg Heiler Feb 16 '19 at 06:09
`spark.stage.maxConsecutiveAttempts` looks like a viable workaround. – Georg Heiler Feb 16 '19 at 06:18

score 0 · Answer 2 · answered Feb 15 '19 at 13:02

0

Close yarn preemption?Or run smaller jobs to avoid complete recomputation?

answered Feb 15 '19 at 13:02

Yves

91
9

well that's definitely an option but not desirable ( both) nor do I have permissions to self-manage yarn queue settings ;) . Are there other possibilities? `spark.yarn.max.executor.failures` sounds interesting. – Georg Heiler Feb 15 '19 at 13:06

how can I increase failure tolerance for spark jobs on yarn? Job failed due to too many preemntions?

2 Answers2