3

I created a binary classification tree using ctree. I would like each terminal node to contain the row names associated with that node. How can I accomplish this?

For example, for the dataset below, I would like the leftmost node to list the names of all those who are aged <23 (Abner to Abudemio) and the rightmost Abundiantus to Acelin respectively.

      names       age height young   
1     Abner       18   76.1   yes
2     Abraham     19   77.0   yes
3     Abram       20   78.1   yes
4     Abrasha     21   78.2   yes
5     Absalom     22   78.8   yes
6     Abudemio    23   79.7   yes
7     Abundiantus 24   79.9    no
8     Acacio      25   81.1    no
9     Acario      26   81.2    no
10    Accursius   27   81.8    no
11    Ace         28   82.8    no
12    Acelin      29   83.5    no

. enter image description here

chl
  • 27,771
  • 5
  • 51
  • 71
Noam Kremen
  • 388
  • 1
  • 14

1 Answers1

5

Here is one hacky solution. It involves very little modification in the original source code of the plotting functions from the party package. By reading the source code, I noticed that there is a terminal_panel which is calling node_barplot in case the outcome is a factor. (Everything is located in the R/plot.R function, if you have source package installed.) We can modify the later to display custom labels in the default bar chart.

Just issue the following command at R prompt:

fixInNamespace("node_barplot", pos="package:party")

and then, start adding what we want:

  1. Add labels = NULL, gp = NULL to the existing list of arguments for that function.
  2. Near the end of the function body, after grid.rect(gp = gpar(fill = "transparent")), add the following lines:

    if (!is.null(labels)) {
      labs <- as.character(labels[ctreeobj@where==node$nodeID])
      len <- length(labs)
      x <- unit(rep(0.5, len), "npc")
      y <- unit(0.5:len/len, "npc")
      for (i in 1:len) 
        grid.text(labs[i], x=x[i], y=y[i], just="center", gp=gp)
    }
    

    Here, the key idea is to select labels corresponding to the selected node (node$nodeID), and we can grab this information from the slot where of the ctree object (this is a vector indicating in which node each case ended up). The if test is just to ensure that we can use the function as originally written. The gp argument can be used to change font size or color.

A typical call to the function would now be:

plot(cfit, tp_pars=list(labels=dfrm$names))

where dfrm$names is a column of labels from a data frame named dfrm. Here is an illustration with your data:

cfit <- ctree(young ~ age, data=a, 
              controls=ctree_control(minsplit=2, minbucket=2))
plot(cfit, tp_args=list(labels=a$names, gp=gpar(fontsize=8, col="darkgrey")))

enter image description here

(I have also tested this with the on-line example with the iris dataset.)

chl
  • 27,771
  • 5
  • 51
  • 71