Similar to Is it possible to access estimator attributes in spark.ml pipelines? I want to access the estimator e.g. the last element in the pipeline.
The approach mentioned there doesn't seem to work any longer for spark 2.0.1. How does it work now?
edit
perhaps I should explain it a little bit more detailed: Here is my estimator + vector assembler:
val numRound = 20
val numWorkers = 4
val xgbBaseParams = Map(
"max_depth" -> 10,
"eta" -> 0.1,
"seed" -> 50,
"silent" -> 1,
"objective" -> "binary:logistic"
)
val xgbEstimator = new XGBoostEstimator(xgbBaseParams)
.setFeaturesCol("features")
.setLabelCol("label")
val vectorAssembler = new VectorAssembler()
.setInputCols(train.columns
.filter(!_.contains("label")))
.setOutputCol("features")
val simplePipeParams = new ParamGridBuilder()
.addGrid(xgbEstimator.round, Array(numRound))
.addGrid(xgbEstimator.nWorkers, Array(numWorkers))
.build()
val simplPipe = new Pipeline()
.setStages(Array(vectorAssembler, xgbEstimator))
val numberOfFolds = 2
val cv = new CrossValidator()
.setEstimator(simplPipe)
.setEvaluator(new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction"))
.setEstimatorParamMaps(simplePipeParams)
.setNumFolds(numberOfFolds)
.setSeed(gSeed)
val cvModel = cv.fit(train)
val trainPerformance = cvModel.transform(train)
val testPerformance = cvModel.transform(test)
Now I want to perform a custom scoring e.g. != 0.5
cut-off point. This is possible if I get hold of the model:
val realModel = cvModel.bestModel.asInstanceOf[XGBoostClassificationModel]
but this step here does not compile. Thanks to your suggestion I can obtain the model:
val pipelineModel: Option[PipelineModel] = cvModel.bestModel match {
case p: PipelineModel => Some(p)
case _ => None
}
val realModel: Option[XGBoostClassificationModel] = pipelineModel
.flatMap {
_.stages.collect { case t: XGBoostClassificationModel => t }
.headOption
}
// TODO write it nicer
val measureResults = realModel.map {
rm =>
{
for (
thresholds <- Array(Array(0.2, 0.8), Array(0.3, 0.7), Array(0.4, 0.6),
Array(0.6, 0.4), Array(0.7, 0.3), Array(0.8, 0.2))
) {
rm.setThresholds(thresholds)
val predResult = rm.transform(test)
.select("label", "probabilities", "prediction")
.as[LabelledEvaluation]
println("cutoff was ", thresholds)
calculateEvaluation(R, predResult)
}
}
}
However, the problem is that
val predResult = rm.transform(test)
will fail as train
does not contain the features column of the vectorAssembler
.
This column is only created when the full pipeline is run.
So I decided to create a second pipeline:
val scoringPipe = new Pipeline()
.setStages(Array(vectorAssembler, rm))
val predResult = scoringPipe.fit(train).transform(test)
but that seems to be a bit clumsy. Do you have a better / nicer idea?