How to execute 'spark.sql' inside an UDF?

Question

When I am trying to execute 'sparl.sql' inside an UDF, I am getting java.lang.NullPointerException. Is there any way, how can I execute that?

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Column, DataFrame, SparkSession}

// Define the UDF
def myUdf(spark: SparkSession): UserDefinedFunction = udf((col1: String, col2: String) => {
  // Execute the SQL query
  val result = spark.sql(s"SELECT 'Hello World!' as text")

  // Return the result as a string
  result.toString()
})

// Use the UDF in a DataFrame transformation
def transform(df: DataFrame, col1: Column, col2: Column): DataFrame = {
  df.withColumn("result", myUdf(spark)(col1, col2))
}

val res = transform(df, col("salary"), col("gender"))
res.show()

Above code is throwing below exception

22/12/07 11:10:32 ERROR Executor: Exception in task 0.0 in stage 1679.0 (TID 19329)
org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$82b5b23cea489b2712a1db46c77e458$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$4802/591483562: (string, string) => string)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:154)
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:152)
        at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)

score 1 · Answer 1 · answered Dec 07 '22 at 18:29

1

I am afraid this won't work.

From a technical standpoint, UDFs in general run on an executor, and Spark session can only be accessed and used to schedule further work, on a driver. The null pointer exception you can is most likely an attempt to obtain some pieces of the Spark session that are not available on the executor.

From a semantic standpoint, if this were permitted, then processing of each row would create a new query potentially processing lots of rows. Imagine a dataframe with 10M records, and creating 10M queries. That would not be feasible to implement.

answered Dec 07 '22 at 18:29

Vladimir Prus

4,600
22
31

Thanks for the clarification @Vladimir Prus. Could you please suggest me any way, how to tackle this scenario? https://stackoverflow.com/questions/74643348/how-to-execute-spark-sql-using-withcolumn-for-streaming-dataframe – Rahul Kumar Dec 08 '22 at 05:20
Sounds like you can first join with your additional table, and then UDF can handle complex logic, but won't have to do SQL queries? – Vladimir Prus Dec 08 '22 at 11:46
I need to process queries written in `Split_Criteria`. Could you tell me, how to do that? – Rahul Kumar Dec 08 '22 at 13:48

How to execute 'spark.sql' inside an UDF?

1 Answers1

Linked