How to extract average metrics with Cross-Validation in PySpark

Question

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?

I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.

paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()    

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5)

model = crossval.fit(df)

evaluator.evaluate(model.transform(df))

For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.

I highly doubt that it is possible - in Spark 1.x, even getting the best model parameters was not possible: https://stackoverflow.com/questions/31749593/how-to-extract-best-parameters-from-a-crossvalidatormodel — desertnaut, Aug 04 '17 at 19:47
But in this case i'm talking about the evaluation metrics, not the parameters. Is there really no way to do it? Or even with the best model ones? — Ed., Aug 04 '17 at 22:51

score 3 · Accepted Answer · answered Aug 02 '18 at 19:23

In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.

For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.

If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

How to extract average metrics with Cross-Validation in PySpark

1 Answers1