Pyspark ML: how to get subModels values with CrossValidator()

Question

I would like to get the cross-validation's (internal) training accuracy, using PySpark end ML library:

lr = LogisticRegression()
param_grid = (ParamGridBuilder()
                     .addGrid(lr.regParam, [0.01, 0.5])
                     .addGrid(lr.maxIter, [5, 10])
                     .addGrid(lr.elasticNetParam, [0.01, 0.1])
                     .build())
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
cv = CrossValidator(estimator=lr, 
                    estimatorParamMaps=param_grid, 
                    evaluator=evaluator, 
                    numFolds=5)
model_cv = cv.fit(train)
predictions_lr = model_cv.transform(validation)
predictions = evaluator.evaluate(predictions_lr)

In order to take the accuracy metric for each c.v. folder, I have tried:

print(model_cv.subModels)

but the result of this method is empty (None).

How could I get the accuracy of each folder?

score 1 · Accepted Answer · answered Nov 09 '20 at 13:14

I know this is old but just in case someone is looking, for spark to save the non-best model(s) during the cross-validation process, one needs to enable collection of submodels when creating a CrossValidator. Just set the value to True (which is False by default).

i.e.

CrossValidator(estimator=lr, 
               estimatorParamMaps=param_grid, 
               evaluator=evaluator, 
               numFolds=5,
               collectSubModels=True)

Pyspark ML: how to get subModels values with CrossValidator()

1 Answers1