1

I've some large datasets where some records cause a UDF to crash. Once such a record is processed, the task will fail which leads to job failes. The problems here are native (we use a native fortran library with JNA), so I cannot catch them in the UDF.

What I'd like to have is a mechanism of fault-tolerance which would allow me to skip/ingore/blacklist bad partitions/tasks such that my spark-app will not fail.

Is there a way to do this?

The only thing I could figure at would be to process small chunks of data in a foreach loop :

val dataFilters: Seq[Column] = ???
val myUDF: UserDefinedFunction = ???

dataFilters.foreach(filter =>
  try {
    ss.table("sourcetable")
      .where(filter)
      .withColumn("udf_result", myUDF($"inputcol"))
      .write.insertInto("targettable")
  }

This is not ideal because spark is rel. slow in processing small amount of data. E.g. the input table is read many times

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • And why can't udf catch the exception and return `None`? – mazaneicha Aug 27 '20 at 15:52
  • @mazaneicha because native signals (e.g. SIGSEGV or Fortran Runtime-Errors cannot be catched because they happen outside the JVM) – Raphael Roth Aug 31 '20 at 06:41
  • Ouch! I guess using JNI from UDFs and calling Fortran libraries that tend to blow up isn't exactly a mainstream Spark usecase, so doubt you'll find an OOTB solution. Probably a custom wrapper with SIGSEGV handler, or a standalone service is in order? – mazaneicha Aug 31 '20 at 12:33

0 Answers0