2

This is a question directly related to the answer provided here: MLR random forest multi label get feature importance

To summarize, the question is about producing a variable importance plot for a multi-label classification problem. I am coping the code provided by another person to produce the vimp plot:

library(mlr)
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
lrn.rfsrc = makeLearner("multilabel.randomForestSRC")
mod2 = train(lrn.rfsrc, yeast.task)

vi =randomForestSRC::vimp(mod2$learner.model)
plot(vi,m.target ="label2")

I am not sure what TRUE, FALSE, and All in the randomForestSRC::vimp plot mean. I read the package documentation and still could not figure it out.

How does that distinction (TRUE, FALSE, All) work?

Miranda
  • 148
  • 13

1 Answers1

3

In that example, you have 14 possible labels. If you look at the data:

head(yeast)
  label1 label2 label3 label4 label5 label6 label7 label8 label9 label10
1  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
2  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE  FALSE   FALSE
3  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
4  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
5   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
6  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE

For every label, for example label2, there are two classes, TRUE / FALSE. So in that plot, In this plot, all is the overall err rate or proportion of the predictions that are wrong for all your samples. TRUE / FALSE is for TRUE / FALSE labels separately. So from this plot, you can see the error in TRUE is higher, meaning the model has problems predicting TRUE correctly.

enter image description here

We can check this by looking at the oob predicted labels:

oob_labels = c(TRUE,FALSE)[max.col(vi$classOutput$label2$predicted.oob)]
table(yeast$label2, oob_labels)

       oob_labels
        FALSE TRUE
  FALSE  1175  204
  TRUE    614  424

You can see for the TRUE labels (2nd row), you get 614/(614+424) = 0.5915222 wrong. This is roughly what you see in the plot, error rate for TRUE label is ~ 0.6.

As for the 2nd variable importance plot, it is along the same lines, variable importance for overall, or TRUE/FALSE class, you can look it like:

par(mfrow=c(1,3))
for(i in colnames(mat)){barplot(mat[,i],horiz=TRUE,las=2)}

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you so much! – Miranda May 19 '20 at 20:16
  • Just a final question. In the second plot, we have positive and negative values for the importance of the variables. What does it mean for a variable to have a negative vimp value? Does it indicate that splitting on that variable will lead to greater prediction error? – Miranda May 27 '20 at 14:28
  • 1
    it depends on how it was calculated. in randomForestSRC, you can check how the importance is calculated https://kogalur.github.io/randomForestSRC/theory.html. it writes " The VIMP for x for a tree is defined as the difference between the perturbed and unperturbed error rate for that tree." – StupidWolf May 27 '20 at 15:20
  • 1
    So the average error rate is more than permutated.. but it doesn't mean what including it lead to greater prediction error. SO is not really meant for this type of questions. you can ask them in cross-validated – StupidWolf May 27 '20 at 15:23
  • Thank you, will do. I am still struggling to interpret the values on the x-axis of the second plot. I am guessing this depends on the context, but can we say some particular value of vimp is either large or small? Is there any convention on that? – Miranda May 27 '20 at 15:47
  • 1
    It's always relative to the data, so large or small... well, i normally say one feature is more important or useful than the other – StupidWolf May 27 '20 at 22:58