1

When I use ranger for a classification model and treeInfo() to extract a tree, I see that sometimes a split results in two identical terminal nodes. Is this expected behaviour? Why does it make sense to introduce a split where the final nodes are the same?

From this question, I take that the prediction variable could be the majority class (albeit for python and another random forest implementation). The ranger ?treeInfo documentation says it should be the predicted class.

MWE

library(ranger)

data <- iris
data$is_versicolor <- factor(data$Species == "versicolor")
data$Species <- NULL

rf <- ranger(is_versicolor ~ ., data = data,
             num.trees = 1, # no need for many trees in this example
             max.depth = 3, # keep depth at an understandable level
             seed = 1351, replace = FALSE)
treeInfo(rf, 1)
#>   nodeID leftChild rightChild splitvarID splitvarName splitval terminal prediction
#> 1      0         1          2          2 Petal.Length     2.60    FALSE       <NA>
#> 2      1        NA         NA         NA         <NA>       NA     TRUE      FALSE
#> 3      2         3          4          3  Petal.Width     1.75    FALSE       <NA>
#> 4      3         5          6          2 Petal.Length     4.95    FALSE       <NA>
#> 5      4         7          8          0 Sepal.Length     5.95    FALSE       <NA>
#> 6      5        NA         NA         NA         <NA>       NA     TRUE       TRUE
#> 7      6        NA         NA         NA         <NA>       NA     TRUE       TRUE
#> 8      7        NA         NA         NA         <NA>       NA     TRUE      FALSE
#> 9      8        NA         NA         NA         <NA>       NA     TRUE      FALSE

In this example, the last four rows (final nodes with nodeID 5 and 6, as well as 7 and 8) have the prediction TRUE and FALSE. Graphically this would look like this

enter image description here

David
  • 9,216
  • 4
  • 45
  • 78

1 Answers1

1

I think I found a (partial) answer to the issue, namely the mtry and min.node.size arguments and their functionality.

As the random forest chooses only mtry variables at each split, the final split might take only variables into account, which do not split the data in a way that results in a maximum gini difference (or whatever metric was chosen) but still in each final node, a given class might prevail.

Playing around with mtry and min.node.size can change this. But we still might get splits with the same results.

David
  • 9,216
  • 4
  • 45
  • 78
  • Hi David,is there a way to visualize the information in treeInfo the way you did here in your initial post? Sorry if this is not an appropriate question. But I am really interested in that. – Daniel2805 Jun 27 '22 at 12:11
  • 1
    @Daniel2805 I cant find the exact code anymore, but I used `data.tree` and `DiagrammeR` to produce the plots! – David Jun 27 '22 at 13:14
  • thx! That is already helpful. I might try to contact you at a later point in time. – Daniel2805 Jun 27 '22 at 14:47