Is it possible to ignore failed tasks in Spark

Question

I've some large datasets where some records cause a UDF to crash. Once such a record is processed, the task will fail which leads to job failes. The problems here are native (we use a native fortran library with JNA), so I cannot catch them in the UDF.

What I'd like to have is a mechanism of fault-tolerance which would allow me to skip/ingore/blacklist bad partitions/tasks such that my spark-app will not fail.

Is there a way to do this?

The only thing I could figure at would be to process small chunks of data in a foreach loop :

val dataFilters: Seq[Column] = ???
val myUDF: UserDefinedFunction = ???

dataFilters.foreach(filter =>
  try {
    ss.table("sourcetable")
      .where(filter)
      .withColumn("udf_result", myUDF($"inputcol"))
      .write.insertInto("targettable")
  }

This is not ideal because spark is rel. slow in processing small amount of data. E.g. the input table is read many times

@mazaneicha because native signals (e.g. SIGSEGV or Fortran Runtime-Errors cannot be catched because they happen outside the JVM) — Raphael Roth, Aug 31 '20 at 06:41
Ouch! I guess using JNI from UDFs and calling Fortran libraries that tend to blow up isn't exactly a mainstream Spark usecase, so doubt you'll find an OOTB solution. Probably a custom wrapper with SIGSEGV handler, or a standalone service is in order? — mazaneicha, Aug 31 '20 at 12:33

Is it possible to ignore failed tasks in Spark

0 Answers0