Spark access estimator in pipeline

Question

Similar to Is it possible to access estimator attributes in spark.ml pipelines? I want to access the estimator e.g. the last element in the pipeline.

The approach mentioned there doesn't seem to work any longer for spark 2.0.1. How does it work now?

edit

perhaps I should explain it a little bit more detailed: Here is my estimator + vector assembler:

val numRound = 20
val numWorkers = 4
val xgbBaseParams = Map(
    "max_depth" -> 10,
    "eta" -> 0.1,
    "seed" -> 50,
    "silent" -> 1,
    "objective" -> "binary:logistic"
  )

val xgbEstimator = new XGBoostEstimator(xgbBaseParams)
    .setFeaturesCol("features")
    .setLabelCol("label")

val vectorAssembler = new VectorAssembler()
    .setInputCols(train.columns
      .filter(!_.contains("label")))
    .setOutputCol("features")

  val simplePipeParams = new ParamGridBuilder()
    .addGrid(xgbEstimator.round, Array(numRound))
    .addGrid(xgbEstimator.nWorkers, Array(numWorkers))
    .build()

   val simplPipe = new Pipeline()
    .setStages(Array(vectorAssembler, xgbEstimator))

  val numberOfFolds = 2
  val cv = new CrossValidator()
    .setEstimator(simplPipe)
    .setEvaluator(new BinaryClassificationEvaluator()
      .setLabelCol("label")
      .setRawPredictionCol("prediction"))
    .setEstimatorParamMaps(simplePipeParams)
    .setNumFolds(numberOfFolds)
    .setSeed(gSeed)

  val cvModel = cv.fit(train)
  val trainPerformance = cvModel.transform(train)
  val testPerformance = cvModel.transform(test)

Now I want to perform a custom scoring e.g. != 0.5 cut-off point. This is possible if I get hold of the model:

val realModel = cvModel.bestModel.asInstanceOf[XGBoostClassificationModel]

but this step here does not compile. Thanks to your suggestion I can obtain the model:

 val pipelineModel: Option[PipelineModel] = cvModel.bestModel match {
    case p: PipelineModel => Some(p)
    case _ => None
  }

  val realModel: Option[XGBoostClassificationModel] = pipelineModel
    .flatMap {
      _.stages.collect { case t: XGBoostClassificationModel => t }
        .headOption
    }
  // TODO write it nicer
  val measureResults = realModel.map {
    rm =>
      {
        for (
          thresholds <- Array(Array(0.2, 0.8), Array(0.3, 0.7), Array(0.4, 0.6),
            Array(0.6, 0.4), Array(0.7, 0.3), Array(0.8, 0.2))
        ) {
          rm.setThresholds(thresholds)

          val predResult = rm.transform(test)
            .select("label", "probabilities", "prediction")
            .as[LabelledEvaluation]
          println("cutoff was ", thresholds)
          calculateEvaluation(R, predResult)
        }
      }
  }

However, the problem is that

val predResult = rm.transform(test)

will fail as train does not contain the features column of the vectorAssembler. This column is only created when the full pipeline is run.

So I decided to create a second pipeline:

val scoringPipe = new Pipeline()
            .setStages(Array(vectorAssembler, rm))
val predResult = scoringPipe.fit(train).transform(test)

but that seems to be a bit clumsy. Do you have a better / nicer idea?

I believe what you are looking for is `pipeline.getStages()` which returns all the stages in the form of an array. You can then access any stage you want. More information in [Documentation](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline). — ShirishT, Nov 11 '16 at 23:21
Possible duplicate of [how to obtain the trained best model from a crossvalidator](http://stackoverflow.com/questions/36347875/how-to-obtain-the-trained-best-model-from-a-crossvalidator) — , Nov 12 '16 at 00:46

score 3 · Accepted Answer · answered Nov 11 '16 at 23:21

Nothing changed in the Spark 2.0.0 and the same approach works. Example Pipeline:

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

And model:

val logRegModel = model.stages.last
  .asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel]

Ms problem is that I want to use the pipeline in a cross validation e.g. the estimator is nested twice. A `cvModel.bestModel.getStages` does not work. So how would I get the pipeline of a Crossvalidator then? — Georg Heiler, Nov 12 '16 at 00:03
Then it is a duplicate of http://stackoverflow.com/questions/36347875/how-to-obtain-the-trained-best-model-from-a-crossvalidator. — , Nov 12 '16 at 00:46

Spark access estimator in pipeline

edit

1 Answers1