1

For a node x in partykit::ctree object, I use the following lines to get the splitting variables on the node:

k=info_node(x)
names(k$p.value)

However, a splitting variables of a node returned by this code is different from the one on the tree created by plot. It turns out that three columns in k$criterion have the minimum p-value; i.e.

inds=which(k$criterion['p.value',]==k$p.value)
length(inds) #3

Seems the info_node(x) returns the 1st of the three variables as names(k$p.value), but plot chooses the 3rd one. I wonder if such discrepancy is caused by two reasons:

  1. Multiple variables have the minimum p-value, and there is an internal method to break such a tie in selecting only one splitting variable.

  2. Maybe these three variable have slightly different p-value, but because of the fixed p-value precision in k$criterion, they appear to have the same p-value.

Any insight is appreciated!

blueskyddd
  • 431
  • 4
  • 12

2 Answers2

0

The comparisons are done internally on the log-p-value scale, i.e., are more reliable in case of tiny p-values. If ties (within machine precision) still remain for the p-value, they are broken based on the size of the corresponding test statistic.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
0

here is one example. Thank you!

library(partykit)
a=rep('N',87)
a[77]='Y'
b=rep(F,87)
b[c(7,10,11,33,56,77)]=T
d=rep(1,87)
d[c(29,38,40,42,65,77)]=0
dfb=data.frame(a=as.factor(a),b=as.factor(b),d=as.factor(d))
tFit=ctree(a ~ ., data=dfb, control = ctree_control(minsplit= 10,minbucket = 5,
                                                    maxsurrogate=2, alpha = 0.05))
plot(tFit) #displayed splitting variable is d
tNodes=node_party(tFit)
nodeInfo=info_node(tNodes)
names(nodeInfo$p.value) #b, not d
blueskyddd
  • 431
  • 4
  • 12