4

We are migrating our spark jobs from 2.4.5 to 3.1.1 (scala 2.11.11 to scala 2.12.13)

We are facing the following exception

org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2465) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911) ~[spark-core_2.12-3.1.1.jar:3.1.1]
        at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:749) ~[spark-sql_2.12-3.1.1.jar:3.1.1]

And this triggered when we use map, flatMap, cogroup (with some functions) on a Dataset. For instance, here is the signature of a function we use to map a dataset to another dataset:

def myFunction(var: (Seq[caseClass1], Option[caseClass2], Option[caseClass3])): (Seq[caseClass2], Seq[caseClass3], Seq[caseClass2], Seq[caseClass3])

The code did not change. We use KryoSerializer and the configuration is almost the same.

It used to work in Spark 2.4.5.

We've found a work-around to make my job work. We wrap it into the following function:

def genMapper[A, B](f: A => B): A => B = {
  val locker = com.twitter.chill.MeatLocker(f)
  x => locker.get.apply(x)
}

As follows

DS.map(genMapper(myFunction)) // This works
// DS.map(myFunction) -- This does not work

Yet I really want to understand what is fundamentally different in serialization behavior between Spark 2.4.5. and Spark 3.x ..

Thank you

cnemri
  • 454
  • 4
  • 14

0 Answers0