0

I was trying to get the binary classification report on pyspark and I ran into this error

StructField' object has no attribute '_get_object_id'

Here is my code

%%spark

from pyspark.mllib.evaluation import BinaryClassificationMetrics
#from pyspark.mllib.evaluation import BinaryClassificationMetrics
predictionAndLabels = test_pred.rdd.map(lambda Row : (float(Row['label']) , Row['prediction']))
metrics = BinaryClassificationMetrics(predictionAndLabels)

Also , Based on the documentation a link! , apparently it does not support f1 measure and recall etc . Any idea why or how we can extract them without low level coding ?

Hamed Niakan
  • 91
  • 1
  • 3

1 Answers1

0

I don't think you have to go that deep. Taking their example of the data from the binary from the documentation you linked and assuming your threshold is p=0.5 cutoff you can just do something like

# f1 = 2 · Precision · Recall/Precision + Recall
# precision = tp / tp+fp
# recall = tp / tp+fn

from pyspark.sql.functions import col

scoreAndLabels = sc.parallelize([(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
df = scoreAndLabels.toDF()

threshold = 0.5

tp = df.where((col('_1')>=threshold) &(col('_2')==1.0)).count()
fp = df.where((col('_1')<threshold) &(col('_2')==1.0)).count()
fn = df.where((col('_1')>=threshold) &(col('_2')==0.0)).count()
precision = tp / (tp+fp)
recall = tp / (tp+fn)
f1 = 2 * (precision * recall) / (precision + recall)

returns f1 = 0.75.

James Natale
  • 476
  • 4
  • 9