Spark Scala app getting NullPointerException while migrating in databricks from DBR 7.3 LTS(spark 3.0.1) to 9.1 LTS(spark 3.1.2)

Question

We are migrating our Spark Scala jobs from AWS EMR (6.2.1 and Spark version - 3.0.1) to Lakehouse and few of our jobs are failing due to NullPointerException. When we tried to lower the Databricks Runtime environment to 7.3 LTS, it is working fine as it has same spark version 3.0.1 as in EMR. However we need to migrate the application to latest Databricks Runtime environment version 11 or 12.

I tried to understand the changes that happened between Spark 3.0 and 3.1 and updated the below parameter:

spark.sql.legacy.castComplexTypesToString.enabled to true. However jobs is still failing.

Its actually failing while writing staging dataframe to s3 location in parquet format.

As mentioned above job is running fine in Databricks LTS 7.2 (spark 3.0.1) and failing with Databricks LTS 9.1 (spark 3.1.2). Please let us know if you faced similar issue or have any advice for us.

Below is the error message:

Caused by: java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException: Null value appeared in non-nullable field:
- array element class: "scala.Boolean"
- root class: "scala.collection.mutable.WrappedArray"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
mapobjects(lambdavariable(MapObject, BooleanType, true, -1), assertnotnull(lambdavariable(MapObject, BooleanType, true, -1)), input[0, array<boolean>, true], Some(class scala.collection.mutable.WrappedArray))
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:201)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:162)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:757)
    at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:102)
    at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:100)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:91)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:819)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:822)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:678)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750

Seems like the error is telling you `Null value appeared in non-nullable field`. Can you share your code with us? I'm suspecting you might need to change your schema to accept nullable fields, or you to have to filter away the nulls if they are not acceptable. — Koedlt, Apr 11 '23 at 07:46

Spark Scala app getting NullPointerException while migrating in databricks from DBR 7.3 LTS(spark 3.0.1) to 9.1 LTS(spark 3.1.2)

0 Answers0