11

I'm using RandomForest.featureImportances but I don't understand the output result.

I have 12 features, and this is the output I get.

I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the output.

// org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11],
 [0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206])
eliasah
  • 39,588
  • 11
  • 124
  • 154
other15
  • 839
  • 2
  • 11
  • 23

2 Answers2

16

Given a tree ensemble model, RandomForest.featureImportances computes the importance of each feature.

This generalizes the idea of "Gini" importance to other losses, following the explanation of Gini importance from "Random Forests" documentation by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.

For collections of trees, which includes boosting and bagging, Hastie et al. suggests to use the average of single tree importances across all trees in the ensemble.

And this feature importance is calculated as followed :

  • Average over trees:
    • importance(feature j) = sum (over nodes which split on feature j) of the gain, where gain is scaled by the number of instances passing through node
    • Normalize importances for tree to sum to 1.
  • Normalize feature importance vector to sum to 1.

References: Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001. - 15.3.2 Variable Importance page 593.

Let's go back to your importance vector :

val importanceVector = Vectors.sparse(12,Array(0,1,2,3,4,5,6,7,8,9,10,11), Array(0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206))

First, let's sort this features by importance :

importanceVector.toArray.zipWithIndex
            .map(_.swap)
            .sortBy(-_._2)
            .foreach(x => println(x._1 + " -> " + x._2))
// 0 -> 0.1956128039688559
// 9 -> 0.1601713590349946
// 2 -> 0.11302128590305296
// 3 -> 0.091986700351889
// 6 -> 0.06929766152519388
// 1 -> 0.06863606797951556
// 8 -> 0.06437052114945474
// 5 -> 0.05975817050022879
// 11 -> 0.057751258970832206
// 7 -> 0.052654922125615934
// 4 -> 0.03430651625283274
// 10 -> 0.0324327322375338

So what does this mean ?

It means that your first feature (index 0) is the most important feature with a weight of ~ 0.19 and your 11th (index 10) feature is the least important in your model.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Great, detailed answer, thank you! I'm doing multiclass classification - 4 classes, would there be a way to compute feature importance for each class? – other15 Jun 17 '16 at 12:51
  • It doesn't seem like it for now. – eliasah Jun 17 '16 at 12:56
  • @other15, my first thought would be to train a binary classifier for each of your 4 classes. Then you would have feature importances for each. Not idea, I know, but it should work. – Zak Kann Jul 01 '16 at 15:53
  • That's not a very good approach. You are considering a strong heuristic saying that the classification is correct for each classifier which might not be the case. If the data is unbalanced your results will be meaningless. – eliasah Jul 01 '16 at 16:04
  • Would a similar assumption not be present in the single-class feature importance (were it available) for results from the multiclass classifier? – Zak Kann Jul 01 '16 at 16:39
  • The issue is that we won't be able to evaluate that assumption in a one-vs-all approach. I'm sorry, I don't have references at my hand now. I'll add it as a comment later (I'm on my mobile phone) – eliasah Jul 01 '16 at 16:41
  • is there a threshold at which you should not use a feature? e.g if a feature is below 0.05 then it shouldn't be used? – other15 Jul 07 '16 at 11:10
  • I dont like to use hard threshold in these kind of scenarios. Nevertheless obviously feature 0, 9 and 2 are more important to the model that 4 and 10. You can try to fit your model with out 4 and 10 and check again. Or maybe compare with some other model. This is quite broad to answer like that without domain knowledge. – eliasah Jul 07 '16 at 11:43
  • @eliasah, have you had a chance to find the aforementioned references? – Zak Kann Jul 11 '16 at 14:25
  • I didn't look for it sorry. But I believe it's in Bishop's Pattern Recognition and Machine Learning book. The chapter concerning multiclass classification and the one-versus-all approach with support vector machine. – eliasah Jul 11 '16 at 14:37
  • 1
    Cool. I'll start my search there. Thanks. – Zak Kann Jul 11 '16 at 14:50
  • perfect description ...Thanks @eliasah – Sahil Desai Jan 29 '18 at 08:30
  • @eliasah is there any threshold value for selecting important features ? – Sahil Desai Jan 29 '18 at 08:34
  • Let's say there is not rules for that. As a good practice, you can always evaluate your model using the most important first and going from there. – eliasah Jan 29 '18 at 08:39
  • Howw can I do this in pyspark? – mah65 Sep 30 '20 at 15:07
4

Adding on to the previous answer:

One of the problems that I faced was in dumping the result in the form of (featureName,Importance) as a csv.One can get the metadata for the input vector of features as

 val featureMetadata = predictions.schema("features").metadata

This is the json structure for this metadata:

{
"ml_attr": {
              "attrs":
                  {"numeric":[{idx:I,name:N},...],
                   "nominal":[{vals:V,idx:I,name:N},...]},
                   "num_attrs":#Attr
                   }
            }
}            

Code for extracting the importance:

val attrs =featureMetadata.getMetadata("ml_attr").getMetadata("attrs")
val f: (Metadata) => (Long,String) = (m => (m.getLong("idx"), m.getString("name")))
val nominalFeatures= attrs.getMetadataArray("nominal").map(f)
val numericFeatures = attrs.getMetadataArray("numeric").map(f)
val features = (numericFeatures ++ nominalFeatures).sortBy(_._1)

val fImportance = pipeline.stages.filter(_.uid.startsWith("rfc")).head.asInstanceOf[RandomForestClassificationModel].featureImportances.toArray.zip(features).map(x=>(x._2._2,x._1)).sortBy(-_._2)

//Save It now
sc.parallelize(fImportance.toSeq, 1).map(x => s"${x._1},${x._2}").saveAsTextFile(fPath)
sourabh
  • 466
  • 4
  • 13