0

I am training a Random Forest model in Spark 2.3 using a StringIndexer, OneHotEncoderEstimator and a RandomForestRegressor. Like this:

//Indexer
val stringIndexers = categoricalColumns.map { colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Idx")
    .setHandleInvalid("keep")
    .fit(training)
}

//HotEncoder
val encoders = featuresEnconding.map { colName =>
  new OneHotEncoderEstimator()
    .setInputCols(Array(colName + "Idx"))
    .setOutputCols(Array(colName + "Enc"))
    .setHandleInvalid("keep")
}  

//Adding features into a feature vector column   
val assembler = new VectorAssembler()
              .setInputCols(featureColumns)
              .setOutputCol("features")


val rf = new RandomForestRegressor()
              .setLabelCol("label")
              .setFeaturesCol("features")
              .setMaxBins(1000)


val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)

val pipelineRF = new Pipeline().setStages(stepsRF)

val paramGridRF = new ParamGridBuilder()
                  .addGrid(rf.minInstancesPerNode, Array(1, 5, 15))
                  .addGrid(rf.maxDepth, Array(10, 11, 12))
                  .addGrid(rf.numTrees, Array(20, 50, 100))
                  .build()


//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")

//Using cross validation to train the model
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)

//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)

I am not sure what are the best combination of parameters for this model, so I added the following Grid of Parameters:

.addGrid(rf.minInstancesPerNode, Array(1, 5, 15))
.addGrid(rf.maxDepth, Array(10, 11, 12))
.addGrid(rf.numTrees, Array(20, 50, 100))

And I let the CrossValidator to calculate the best combination. Now What I would like is to find out which combination it picked up, to keep tunning the model from there. So I was trying to get this parameters like this:

cvRFModel.bestModel.extractParamMap

But I am getting an empty map:

org.apache.spark.ml.param.ParamMap =
{

}

What am I missing?

Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94

1 Answers1

1

Based on the following question I tried this, but I am not sure if it is the correct approach:

val avgMetricsParamGrid = cvRFModel.avgMetrics

val combined = paramGridRF.zip(avgMetricsParamGrid)

val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]


val parms = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].explainParams

And it gave me information of several parameters like this:

labelCol: label column name (default: label, current: label) maxBins: Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature. (default: 32, current: 1000) maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5, current: 12) maxMemoryInMB: Maximum memory in MB allocated to histogram aggregation. (default: 256) minInfoGain: Minimum information gain for a split to be considered at a tree node. (default: 0.0) minInstancesPerNode: Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1. (default: 1, current: 1) numTrees: Number of trees to train (>= 1) (default: 20, current: 20) predictionCol: prediction column name (default: prediction) seed: random seed (default: 235498149) subsamplingRate: Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0)

What I am not sure still is what stage I need to select. I decided to choose the last one since the training process is iterative, but I am not 100% sure if this is the correct answer. Any feedback will be appreciated.

Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94