If i run the following code in spark ( 2.3.2.0-mapr-1901) , it runs fine on the first run.
SELECT count( `cpu-usage` ) as `cpu-usage-count` , sum( `cpu-usage` ) as `cpu-usage-sum` , percentile_approx( `cpu-usage`, 0.95 ) as `cpu-usage-approxPercentile`
FROM filtered_set
Where filtered_set
is a DataFrame that has been registered as a temp view using createOrReplaceTempView
.
I get a result and all is good on the first call. But...
If i then run this job again, ( note that this is a shared spark context, managed via apache livy), Spark throws:
Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: Undefined function: 'count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 10
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1216)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1216)
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1215)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1213)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
...
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
This problem occurs on the second run of the Livy job ( which is using the previous Spark session). It is not isolated just to the count function, (etc also happens with sum, etc) and any function appears to fail on the second run, regardless of what was called in the first run.
It seems like Spark's function registry is being cleared out ( including the default built in functions). We're not doing anything with the spark context.
Questions: - Is this expected or normal behaviour with spark? - How would I re-set or initialise the spark session so it doesn't lose all these functions?
I have seen Undefined function
errors described elsewhere in terms of user defined functions but never the built ins.