We are migrating our spark jobs from 2.4.5 to 3.1.1 (scala 2.11.11 to scala 2.12.13)
We are facing the following exception
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.SparkContext.clean(SparkContext.scala:2465) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:749) ~[spark-sql_2.12-3.1.1.jar:3.1.1]
And this triggered when we use map, flatMap, cogroup (with some functions) on a Dataset. For instance, here is the signature of a function we use to map a dataset to another dataset:
def myFunction(var: (Seq[caseClass1], Option[caseClass2], Option[caseClass3])): (Seq[caseClass2], Seq[caseClass3], Seq[caseClass2], Seq[caseClass3])
The code did not change. We use KryoSerializer and the configuration is almost the same.
It used to work in Spark 2.4.5.
We've found a work-around to make my job work. We wrap it into the following function:
def genMapper[A, B](f: A => B): A => B = {
val locker = com.twitter.chill.MeatLocker(f)
x => locker.get.apply(x)
}
As follows
DS.map(genMapper(myFunction)) // This works
// DS.map(myFunction) -- This does not work
Yet I really want to understand what is fundamentally different in serialization behavior between Spark 2.4.5. and Spark 3.x ..
Thank you