0

I have saved models which were created using the rpart package in R. I am trying to retrieve some information from these saved models; specifically from rpart.object. While the documentation - rpart doc - is helpful there are a few things it is not clear about:

  1. How do I find out which variables are categorical and which are numeric? Currently, what I do is refer to the 'index' column in the splits matrix. I've noticed that for numeric variables only, the entry is not an integer. Is there a cleaner way to do this?
  2. The csplit matrix refers to the various values a categorical variable can take using integers i.e. R maps the original names to integers. Is there a way to access this mapping? For ex. if my original variable, say, Country can take any of the values France, Germany, Japan etc, the csplit matrix lets me know that a certain split is based on Country == 1, 2. Here, rpart has replaced references to France, Germany with 1, 2 respectively. How do I get the original names - France, Germany, Japan - back from the model file? Also, how do I know what the mapping between the names and the integers is?
abhgh
  • 308
  • 1
  • 9

1 Answers1

2

Generally it is the terms component that would have that sort of information. See ?rpart::rpart.object.

fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms  # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
 Kyphosis       Age    Number     Start 
 "factor" "numeric" "numeric" "numeric" 

That example doesn't have a csplit node in its structure because none of hte variables are factors. You could make one fairly easily:

> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
     [,1] [,2] [,3]
[1,]    1    1    3
[2,]    1    1    3
[3,]    3    1    3
[4,]    1    3    3
[5,]    3    1    3
[6,]    3    3    1
[7,]    3    1    3
[8,]    1    1    3
> attr(fit$terms, "dataClasses")
                                     Kyphosis 
                                     "factor" 
                                          Age 
                                    "numeric" 
factor(findInterval(Number, c(0, 4, 6, Inf))) 
                                     "factor" 
                                        Start 
                                    "numeric" 

The integers are just the values of the factor variables so the "mapping" is just the same as it would be from as.numeric() to the levels() of a factor. If I were trying to construct a character matrix version of the fit$csplit-matrix that substituted the names of the levels in a factor variable, this would be one path to success:

> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame':   81 obs. of  5 variables:
 $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
 $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
 $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
 $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
 $ Numlev  : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
     [,1]   [,2]   [,3]  
[1,] "low"  "low"  "high"
[2,] "low"  "low"  "high"
[3,] "high" "low"  "high"
[4,] "low"  "high" "high"
[5,] "high" "low"  "high"
[6,] "high" "high" "low" 
[7,] "high" "low"  "high"
[8,] "low"  "low"  "high"

Response to comment: If you only have the model then use str() to look at it. I see an "ordered" leaf in the example I created that has the factor labels stored in an attribute named "xlevels":

$ ordered            : Named logi [1:3] FALSE FALSE FALSE
  ..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
 - attr(*, "xlevels")=List of 1
  ..$ Numlev: chr [1:3] "low" "med" "high"
 - attr(*, "ylevels")= chr [1:2] "absent" "present"
 - attr(*, "class")= chr "rpart"
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks! - the term component does give me the variable types explicitly. Is there a way to access the name to integer mapping? – abhgh Apr 05 '15 at 16:39
  • You will need to help understand what you mean by that phrase. – IRTFM Apr 05 '15 at 16:48
  • Adding details to the question. – abhgh Apr 05 '15 at 16:48
  • Thanks again for your response! To get back the level names you are using `kyphosis$Numlev`. But at this stage I only have access to the model file. I don't have the data. – abhgh Apr 05 '15 at 19:12
  • If you call `fit <- rpart(..., model = TRUE)` then the `model.frame` of the training data is stored within the object (by default it is not). Then you can easily access it via `fit$model`. If it is not stored in the object but the data is still available in your R session, then you can use `library("partykit")` and `model.frame(fit)` which re-evaluates the `$call` using the model's `$terms`. – Achim Zeileis Apr 06 '15 at 06:52
  • That nailed it! Thanks! `attr(fit, 'xlevels')` gives me what I need. – abhgh Apr 06 '15 at 09:26
  • Thanks @AchimZeileis. Right now I have no control over how the model is saved; so I would have tomake the pessimistic assumption that data is not part of the model. Also I don't have access to the R session - I usually get these models sent to me offline :). But thanks for pointing me to `partykit` - I'm pretty sure I will be able to use it somewhere else. – abhgh Apr 06 '15 at 11:34