Spark 1.5.1, MLLib Random Forest Probability

Question

I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here?

eliasah · Accepted Answer · 2016-10-12T08:35:01.150

Unfortunately the feature is not available in the older Spark MLlib 1.5.1.

You can however find it in the recent Pipeline API in Spark MLlib 2.x as RandomForestClassifier:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file, converting it to a DataFrame.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel").fit(data)

// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4).fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol(labelIndexer.getOutputCol)
  .setFeaturesCol(featureIndexer.getOutputCol)
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Fit model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)
// predictions: org.apache.spark.sql.DataFrame = [label: double, features: vector, indexedLabel: double, indexedFeatures: vector, rawPrediction: vector, probability: vector, prediction: double, predictedLabel: string]

predictions.show(10)
// +-----+--------------------+------------+--------------------+-------------+-----------+----------+--------------+
// |label|            features|indexedLabel|     indexedFeatures|rawPrediction|probability|prediction|predictedLabel|
// +-----+--------------------+------------+--------------------+-------------+-----------+----------+--------------+
// |  0.0|(692,[124,125,126...|         1.0|(692,[124,125,126...|   [0.0,10.0]|  [0.0,1.0]|       1.0|           0.0|
// |  0.0|(692,[124,125,126...|         1.0|(692,[124,125,126...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[129,130,131...|         1.0|(692,[129,130,131...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[154,155,156...|         1.0|(692,[154,155,156...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[154,155,156...|         1.0|(692,[154,155,156...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[181,182,183...|         1.0|(692,[181,182,183...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  1.0|(692,[99,100,101,...|         0.0|(692,[99,100,101,...|    [4.0,6.0]|  [0.4,0.6]|       1.0|           0.0|
// |  1.0|(692,[123,124,125...|         0.0|(692,[123,124,125...|   [10.0,0.0]|  [1.0,0.0]|       0.0|           1.0|
// |  1.0|(692,[124,125,126...|         0.0|(692,[124,125,126...|   [10.0,0.0]|  [1.0,0.0]|       0.0|           1.0|
// |  1.0|(692,[125,126,127...|         0.0|(692,[125,126,127...|   [10.0,0.0]|  [1.0,0.0]|       0.0|           1.0|
// +-----+--------------------+------------+--------------------+-------------+-----------+----------+--------------+
// only showing top 10 rows

Note: This example is from the official documentation of Spark MLlib's ML - Random forest classifier.

And here is some explanation on some output columns :

predictionCol represents the predicted label .
rawPredictionCol represents a Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction (available for Classification only).
probabilityCol represents the probability Vector of length # classes equal to rawPrediction normalized to a multinomial distribution (available with Classification only).

I see ... why does it want to make ml-randomForest and mlliv-randomForest? What's the difference between these two libraries? Why not just combine to one? — Edamame, Oct 28 '15 at 21:59
The spark.ml package aims to provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines, whereas the original MLlib library deals with RDDs. Building algorithms over DataFrame is conceptually very different from traditional map reduce operations on RDDs. The unification of the two libraries is not as trivial as is sounds. — eliasah, Oct 28 '15 at 22:05
my training data is pretty big and stored in a RDD. Is there a way I can train the ml-randomForest with RDD? Or is there anyway I can retrieve the probability using mllib-randomForest? Thanks! — Edamame, Oct 28 '15 at 22:20
You'll need to transform it into a DataFrame, no other way around. That won't be very expensive thought, even thought type conversions are still not optimized. But with the Tungsten project in Spark, operations made on DataFrames are optimized resulting in better performed time-wise. Meaning that the time you might loose with the conversion is gained on the computation level when you apply your algorithm. — eliasah, Oct 28 '15 at 22:26
Is the DataFrame a RDD? Is there a constraint on the size of the training data? How do we know if my training data could fit into the DataFrame? Thanks! — Edamame, Oct 28 '15 at 22:30
DataFrame is structural abstraction over RDDs. It's not an RDD. Conceptually it's the same like in R/Pandas but it's distributed since its internal structure is actually an RDD. In other terms, if it fits in your RDD, it will fit into your DataFrame. — eliasah, Oct 28 '15 at 22:35
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93660/discussion-between-eliasah-and-edamame). — eliasah, Oct 29 '15 at 09:51

TNM · Answer 2 · 2016-06-16T07:19:46.960

4

You can't directly get the classification probabilities but it is relatively easy to calculate it yourself. RandomForest is an ensemble of trees and its output probability is the majority vote of these trees divided by the total number of trees.

Since the RandomForestModel in MLib gives you the trained trees it is easy to do it yourself. The following code gives the probability for the binary classification problem. Its generalization to multi-class classification is straightforward.

  def predict(points: RDD[LabeledPoint], model: RandomForestModel) = {
    val numTrees = model.trees.length
    val trees = points.sparkContext.broadcast(model.trees)
    points.map { point =>
    trees.value
    .map(_.predict(point.features))
    .sum / numTrees
  }

}

for multi-class case you only need to replace map with .map(_.predict(point.features)-> 1.0) and group by key instead of sum and finally take the max of values.

edited Jun 16 '16 at 07:19

answered Dec 03 '15 at 08:35

TNM

1,361
3
15
25

Thanks TNM! But in my use case, I am applying the predict function on a data point, not an RDD. i.e. I have modelObject.predict(myPoint) where myPoint is of type: org.apache.spark.mllib.linalg.Vector . Can I still compute probability for such case? Thanks! – Edamame Dec 18 '15 at 20:35
1

Yes you can. Just replace the input type of the function with point:Vector and remove the points.map part instead just strat with trees.map{... – TNM Dec 18 '15 at 21:09
Thanks TNM! When I tried to do: val trees = point.sparkContext.broadcast(modelObject.trees) where point is of type: org.apache.spark.mllib.linalg.Vector , I got the error saying: value sparkContext is not a member of org.apache.spark.mllib.linalg.Vector . Am I missing anything ? Is there a way to fix this? Thanks! – Edamame Dec 18 '15 at 21:26
Seems to work if I do: val prob = modelObject.trees.map(_.predict(point)).sum / modelObject.trees.length Is this correct? Thanks! – Edamame Dec 18 '15 at 21:30
It is correct.using broadcast in the original answer was only an optimization. – TNM Dec 19 '15 at 02:39
This is missing two things: 1) it doesn't factor in the weights for the different trees - the ensemble classification is weighted by the out-of-bag-error, so should the probability prediction be 2) decision trees actually give a probability directly at the leaves - the fraction of positive cases at the leaf - unfortunately there doesn't appear to be an easy way to get this out of the spark model – Brian Oct 18 '16 at 13:30
I never heard of a random forest ensemble strategy that weighs each tree based on out-of-bag-error. Perhaps because it breaks the theory and the output model is not guaranteed anymore to be immune against overfitting. but be honest sounds like an interesting variation of the original algorithm. I also looked into the spark code and there also the trees weren't weighted. As for your second point, you can get the prob. out of ML.Decision tree by calling probabilityCol. – TNM Oct 24 '16 at 15:22

Spark 1.5.1, MLLib Random Forest Probability

2 Answers2

Linked