Spark SQL on LR prediction error

Question

I run the following Jupyter notebook query against the dataframe "Preds" as a simplified DF of a prediction outcome:

I succeeded in making simple query against "label", but NOT "prediction" (even for same query) but failed for complex query. I suspected the output field "predication" from two classes MLlib Linear Regression, may causing problems during possible type conversion:

(however I cant think of why it calls a string to double conversion unless from the input screen)

-----+----------+
|label|prediction|
+-----+----------+
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
..................
..................

root
 |-- label: double (nullable = true)
 |-- prediction: double (nullable = true)


%%sql 

Select 
case 
    when label = 1.0 and prediction = 1.0 then 'True Positive' 
    when label = 0.0 and prediction = 0.0 then 'True Negative' 
    when label = 0.0 and prediction = 1.0 then 'False Positive' 
    when label = 1.0 and prediction = 0.0 then 'False Negatives'
    else 'Unknown' end 
    as Cases 
from Preds

** look likes the following cause of problem from ==> Failed to execute user defined function($anonfun$4: (string) => double)"**

Lengthily error log:

 An error was encountered:
 An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
179.0 (TID 3748, wn0- 
abrshd.s2yinkedijvevogpqsbgf14b1h.hx.internal.cloudapp.net, executor 2): 
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$4: (string) => double)
at 
    org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.Dataset$$anonfun$56$$anon$1.hasNext(Dataset.scala:2712)
at org.apache.spark.sql.Dataset$$anonfun$56$$anon$1.next(Dataset.scala:2718)
at org.apache.spark.sql.Dataset$$anonfun$56$$anon$1.next(Dataset.scala:2711)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1963)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: org.apache.spark.SparkException: Unseen label: video store.
at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
... 14 more

Appreciate any tips or comments?

And [Spark, ML, StringIndexer: handling unseen labels](https://stackoverflow.com/q/34681534/9613318) — Alper t. Turker, May 04 '18 at 09:50
My error is at the stage of Spark SQL query not running the ML pipeline.. so it is different from others.... — r poon, Jun 06 '18 at 17:40
I suspected those "1.0"s in those SQL lines are strings, not double... where as the Preds column holds double content, so they somehow need to be converted to double? — r poon, Jun 06 '18 at 17:55

Spark SQL on LR prediction error

0 Answers0