0

I have trained a partykit package ctree classification decision tree and I need to calculate classification probabilities for sub tree (not only for leaf nodes). So for example if a sub tree consists of 3 leaf nodes with the following probabilities: leaf 1 (120 observations) : 0.45 leaf 2 (160 observations) : 0.49 leaf 3 (190 observations) : 0.83

for this hypothetical sub tree the weighted average probability would be 120*0.42 + 160*0.49 + 190*0.83 / (120+160+190) = 0.507

and so on I need to traverse on the ctree object and calculate all weighted probabilities for each node recursively.

I have this code:

data(airquality)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
                 controls = ctree_control(maxsurrogate = 3))
traverse <- function(treenode){
    if(treenode$terminal){
      bas=paste("Current node is terminal node with",treenode$nodeID,'prediction',treenode$prediction)
      print(bas)
      return(0)
    } else {
      bas=paste("Current node",treenode$nodeID,"Split var. ID:",treenode$psplit$variableName,"split value:",treenode$psplit$splitpoint,'prediction',treenode$prediction)
      print(bas)
    }
    traverse(treenode$left)
    traverse(treenode$right)
  }

which traverse on the tree does not work on partykit object. On the other hand I have this code, which lists all porbabilities for leaf nodes only :

preds.ls <- list(predict(airct , type = "prob"))[1]
pred.probs.df <- unique(as.data.frame((preds.ls[[1]])))

Any suggestions to combine these 2 snippets to a code that will traverse on a PARTYKIT object and calculate this weighted average are appreciated

NRG
  • 149
  • 2
  • 10
  • It's not quite clear to me what exactly want to do as the code as posted contains some errors. However, I think that this answer will help you do what you want (or ask a more precise question): http://stackoverflow.com/questions/41968910/r-extracting-inner-node-information-and-splits-from-ctree-partykit/41976697#41976697 – Achim Zeileis Mar 22 '17 at 11:37

1 Answers1

0

I'm not familiar with partykit but this simple function walks a ctree and extracts the probability for every internal and terminal node:

   library(party)

    set.seed(100)
    dt <- ctree(factor(mpg > 20)~., data = mtcars,
                control = ctree_control(minsplit=2, minbucket=1, mincriterion=0))

    traverse <- function(node) {
      if (node$terminal) {
        return(node$prediction[2])
      }
      return(c(node$prediction[2],
               traverse(node$left), traverse(node$right)))
    }

enter image description here

Calling the function produces the following vector of probabilities:

> traverse(dt@tree)
[1] 0.4375000 1.0000000 0.1428571 0.4285714 0.7500000 0.0000000 0.0000000

The left most value is the population value verified by the following:

> mean(mtcars$mpg > 20)
[1] 0.4375

The rest of the values are going to be in order from left to right. You can see that the 1s and 0s line up where expected.

Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • Does this implementation take under consideration the number of observations for each leaf/terminal node ? – NRG Mar 21 '17 at 15:15
  • Yes, the internal node probabilities are the probabilities of the entire subtree. – Zelazny7 Mar 21 '17 at 15:20
  • Note that this solution pertains to the old `party` implementation but does not work for the newer `partykit` implementation. The representation of the tree changed significantly. – Achim Zeileis Mar 22 '17 at 11:36
  • @ Achim Zeileis - Do you have any reference to a partykit solution to calculate sub tree probability ? – NRG Mar 23 '17 at 08:28