0

How can I increase failure tolerance on yarn? In a busy cluster my job fails due to too many failures. Most of the failures were due to Executor lost base by preemption.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

1

If you have preemption enabled you really should be using the external shuffle service to avoid these issues. There's really not much that can be done otherwise.

https://issues.apache.org/jira/browse/SPARK-14209 - JIRA talks about.

Ashkrit Sharma
  • 627
  • 5
  • 7
0

Close yarn preemption?Or run smaller jobs to avoid complete recomputation?

Yves
  • 91
  • 9
  • well that's definitely an option but not desirable ( both) nor do I have permissions to self-manage yarn queue settings ;) . Are there other possibilities? `spark.yarn.max.executor.failures` sounds interesting. – Georg Heiler Feb 15 '19 at 13:06