I've some large datasets where some records cause a UDF to crash. Once such a record is processed, the task will fail which leads to job failes. The problems here are native (we use a native fortran library with JNA), so I cannot catch them in the UDF.
What I'd like to have is a mechanism of fault-tolerance which would allow me to skip/ingore/blacklist bad partitions/tasks such that my spark-app will not fail.
Is there a way to do this?
The only thing I could figure at would be to process small chunks of data in a foreach loop :
val dataFilters: Seq[Column] = ???
val myUDF: UserDefinedFunction = ???
dataFilters.foreach(filter =>
try {
ss.table("sourcetable")
.where(filter)
.withColumn("udf_result", myUDF($"inputcol"))
.write.insertInto("targettable")
}
This is not ideal because spark is rel. slow in processing small amount of data. E.g. the input table is read many times