3

I am plotting a tree with rpart.plot::prp(), much like:

library("rpart.plot")
data("ptitanic")
data <- ptitanic
data$sibsp <- as.integer(data$sibsp) # just to show that these are integers
data$age <- as.integer(data$age) # just to show that these are integers
tree <- rpart(survived~., data=data, cp=.02)
prp(tree, , fallen.leaves = FALSE, type=4, extra=1, varlen=0, faclen=0, yesno.yshift=-1)

enter image description here

Even though certain variables are integers (age and sibsp), rpart creates a seemingly arbitrary split point, which confuses the viewer. Nobody has 2.5 siblings/spouses aboard -- the logical split is sibsp >= 3

I have looked at split.fun in this excellent tutorial and ?prp. Other than using a regex to capture the number, format it properly, and replace it in the label string, I can't think of any solutions within prp.

A workaround I am considering is to pass a modified tree (object of class rpart) where the contents have been rounded. Is it possible to do this by modifying tree$splits?

Any other ideas?

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134

2 Answers2

4

1) ordered factors I think age is OK as a continuous variable but to handle sibsp and parch make them into ordered factors:

data <- transform(data, sibsp = ordered(sibsp), parch = ordered(parch))
tree <- rpart(survived~., data=data, cp=.02)
prp(tree, , fallen.leaves = FALSE, type=4, extra=1, varlen=0, faclen=0, yesno.yshift=-1)

screenshot

2) split.fun Another approach is to specify our own split.fun like this:

# next 4 lines are same as in question
data <- ptitanic
data$sibsp <- as.integer(data$sibsp) # just to show that these are integers
data$age <- as.integer(data$age) # just to show that these are integers
tree <- rpart(survived~., data=data, cp=.02)

split.labs <- function(x, labs, digits, varlen, faclen) {
   sapply(labs, function(lab) 
      if (grepl(">=|<", lab)) {
         rhs <- sub(".* ", "", lab)
         lab <- sub(rhs, ceiling(as.numeric(rhs)), lab)
      } else lab)
} 
prp(tree, , fallen.leaves = FALSE, type=4, extra=1, varlen=0, faclen=0, yesno.yshift=-1, 
   split.fun = split.labs) # same as in question except for split.fun= arg

This gives:

screenshot

(2a) A variation of (2) which gives slightly more control, i.e. one can specify precisely which variables to modify, is the following:

# next 4 lines are same as in question
data <- ptitanic
data$sibsp <- as.integer(data$sibsp) # just to show that these are integers
data$age <- as.integer(data$age) # just to show that these are integers
tree <- rpart(survived~., data=data, cp=.02)

split.labs2 <- function(x, labs, digits, varlen, faclen) {
    sapply(labs, function(lab) 
        if (grepl("age|sibsp|parch", lab)) {
            rhs <- sub(".* ", "", lab);
            lab <- sub(rhs, ceiling(as.numeric(rhs)), lab)
        } else lab)
} 

# similar to (2) except we use clip.right.labs = FALSE and split.labs2

prp(tree, type = 4, fallen.leaves = FALSE, extra=1, varlen=0, faclen=0, 
   yesno.yshift=-1, clip.right.labs = FALSE, split.fun = split.labs2)

screenshot

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for the answer. Using ordered factor does force the partitions to discretize, but I'm not sure it improves the readability of the labels. In both cases the mental adjustment is required ("OK, this means three or more, this means two or fewer") whereas just `>= 3` and `< 3` seems much simpler. Also, age is (almost) always a whole number in the underlying dataset (see `data("ptitanic")` and I worry about a false sense of precision being conveyed by `age >= 9.5` when in fact there are people with ages 9.0 and 10.0 in the dataset, but none between. – C8H10N4O2 Feb 22 '16 at 14:57
  • Thanks, this is the regex approach I imagined would work. One point to consider -- if the split was `x <= 9.5` and `x > 9.5`, then applying `ceiling()` to the split point would move an observation where `x == 10` from the right branch to the left. – C8H10N4O2 Feb 22 '16 at 17:48
  • 2
    Note that there are only `eq`, `lt` and `ge` arguments to `prp`. There is no `le` argument and no `gt` argument so it seems that the situation discussed in your comment cannot occur. – G. Grothendieck Feb 23 '16 at 15:58
  • OK, so `ceiling()` should be fine then. Thanks for your help. – C8H10N4O2 Feb 23 '16 at 21:27
1

Version 3.0.0 of the rpart.plot package (July 2018) treats predictors with integer values specially to automatically get the results you want.

So rpart.plot now automatically prints sibsp >= 3 instead of sibsp >= 2.5, since it sees that in the training data all values of sibsp are integral.

Section 4.1 of the vignette for the rpart.plot package has an example.

Stephen Milborrow
  • 976
  • 10
  • 14
  • 1
    from `?prp`: New in version 3.0.0. If `roundint=TRUE` (default) and all values of a predictor in the training data are integers, then splits for that predictor are rounded to integer. For example, display `nsiblings < 3` instead of `nsiblings < 2.5`. Exactly the change I was looking for. Thanks. – C8H10N4O2 Oct 12 '18 at 14:08