-1

I know that randomForest is supposed to be a black box, and that most people are interested in the ROC curve of the classifier as a whole, but I'm working on a problem in which I need to inspect individual trees of RF. I'm not very experienced with R so what's an easy way to plot ROC curves for the individual trees generated by RF?

MaYa
  • 180
  • 7

1 Answers1

1

I don't think you can generate a ROC curve from a single tree from a random forest generated by the randomForest package. You can access the output of each tree from a prediction, for example over the training set.

# caret for an example data set
library(caret)
library(randomForest)

data(GermanCredit)

# use only 50 rows for demonstration
nrows = 50

# extract the first 9 columns and 50 rows as training data (column 10 is "Class", the target)
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]

# build the model
rf_model = randomForest(x = x, y = y, ntree = 11)

# Compute the prediction over the training data. Note predict.all = TRUE
rf_pred = predict(rf_model, newdata = x, predict.all = TRUE, type = "prob")

You can access the predictions of each tree with

 rf_pred$individual

However, the prediction of a single tree is only the most likely label. For a ROC curve you need class probabilities, so that changing the decision threshold changes the predicted class to vary true and false positive rates.

As far as I can tell, at least in package randomForest there is no way to make the leaves output probabilities instead of labels. If you inspect a tree with getTree(), you will see that the prediction is binary; use getTree(rf_model, k = 1, labelVar = TRUE) and you'll see the labels in plain text.

What you can do, though, is to retrieve individual predictions via predict.all = TRUE and then manually compute class labels on subsets of the whole forest. This you can then input into a function to compute ROC curves like those from the ROCR package.

Edit: Ok, from the link you provided in your comment I got the idea how a ROC curve can be obtained. First, we need to extract one particular tree and then input each data point into the tree, in order to count the occurances of the success class at each node as well as total data points in each node. The ratio gives the node probability for success class. Next, we do something similar, i.e. input each data point into the tree, but now record the probability. This way we can compare the class probs with the true label. Here is the code:

# libraries we need
library(randomForest)
library(ROCR)

# Set fixed seed for reproducibility
set.seed(54321)

# Define function to read out output node of a tree for a given data    point
travelTree = function(tree, data_row) {
    node = 1
    while (tree[node, "status"] != -1) {
        split_value = data_row[, tree[node, "split var"]]
        if (tree[node, "split point"] > split_value ) {
            node = tree[node, "right daughter"]
        } else {
            node = tree[node, "left daughter"]
        }
    }
    return(node)
}

# define number of data rows
nrows = 100
ntree = 11

# load example data
data(GermanCredit)

# Easier access of variables
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]

# Build RF model
rf_model = randomForest(x = x, y = y, ntree = ntree, nodesize = 10)

# Extract single tree and add variables we need to compute class probs
single_tree = getTree(rf_model, k = 2, labelVar = TRUE)
single_tree$"split var" = as.character(single_tree$"split var")
single_tree$sum_good = 0
single_tree$sum = 0
single_tree$pred_prob = 0


for (zeile in 1:nrow(x)) {
    out_node = travelTree(single_tree, x[zeile, ])
    single_tree$sum_good[out_node] = single_tree$sum_good[out_node] + (y[zeile] == "Good")
    single_tree$sum[out_node] = single_tree$sum[out_node] + 1
}

# Compute class probabilities from count of "Good" data points in each node.
# Make sure we do not divide by zero
idcs = single_tree$sum != 0
single_tree$pred_prob[idcs] = single_tree$sum_good[idcs] /     single_tree$sum[idcs]

# Compute prediction by inserting again data set into tree, but read out
# previously computed probs

single_tree_pred = rep(0, nrow(x))

for (zeile in 1:nrow(x)) {
    out_node = travelTree(single_tree, x[zeile, ])
    single_tree_pred[zeile] = single_tree$pred_prob[out_node]
}

# Et voila: The ROC curve for single tree!
plot(performance(prediction(single_tree_pred, y), "tpr", "fpr"))
Calbers
  • 399
  • 3
  • 4
  • This makes perfect sense! I read trees in javascript and calculate leaf node probabilities by running the entire dataset down the tree and calculating scores like [here](http://stats.stackexchange.com/questions/105760/how-we-can-draw-an-roc-curve-for-decision-trees/110500#110500?newreg=9ca13b7b43bf4985b9e75a5cc1cb2ae6). I'm not sure what counts as a true prediction in leaves though in the multi-class classification case. Do I use the most likely label as you said? everything else in this leaf counts as false? and how do I aggregate scores across leaves of a tree? Thank you very much. – MaYa Mar 08 '17 at 06:39
  • I haven't thought about using the tree structure given by getTree to compute the labels on the data manually. I don't think there is a function for that in the randomForest package, but it is actually possible to compute probs. I have no experience in multi-class classification; if pressed, I would do one-vs-all classification. In context of ROC-curves multi-class doesn't make any sense, either. I'm sorry I can not help you more with that. – Calbers Mar 12 '17 at 20:44
  • In case you are still listening (probably not relevant anymore, but for sake of completeness) I added code in order to compute a ROC curve from a single tree from a random forest. Have fun! – Calbers Mar 14 '17 at 21:57
  • Thank you for adding code! I'm not sure if my understanding of this line is correct though: `single_tree_pred[zeile] = single_tree$pred_prob[out_node]` is it grabbing the probability of true class in the leaf node in which this sample falls? – MaYa Mar 18 '17 at 14:18
  • Yes. `single_tree_pred` is a vector that holds the outputted probabilities of the tree for each row ("zeile" is german for "row", forgot to change the variable name) in the training data `x`, so this line does what you think it does. The function `prediction()` from the ROCR-packages needs a vector of probabilities aligned with the vector of true labels, so this is why I did it like this. Actually the link you gave in the first comment, the answer by rapaio, gave me the inspiration how to give class probs for a single tree. It was a fun exercise in the end, hope it helps! – Calbers Mar 18 '17 at 17:39
  • It does help! Thanks a lot for your time. Glad you had fun with it :) – MaYa Mar 18 '17 at 18:49