8

i am currently using the rpart package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.

from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).

however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?

here is a sample code emulating my data:

library(rpart)

x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)

x.fac <- as.data.frame(lapply(x, factor))

now compare:

system.time(rpart( y ~ ., data = x, method = 'anova'))

   user  system elapsed 
   1.62    0.21    1.85 

system.time(rpart( y ~ ., data = x.fac, method = 'anova'))

   user  system elapsed 
   246.87  165.91  412.92 

dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart behaviour. any clarifications/suggestions would be very apprecaited.

stas g
  • 1,503
  • 2
  • 10
  • 20

2 Answers2

6

You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.

Try

Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()

summaryRprof("rpartProfile.Rprof")

and look to see where the time is being spent. Which I have now done:

> summaryRprof("rpartProfile.Rprof")
$by.self
                          self.time self.pct total.time total.pct
"[[<-.data.frame"            786.46    72.45     786.56     72.46
"rpart.matrix"               294.26    27.11    1081.78     99.66
"model.frame.default"          1.04     0.10       3.00      0.28
"terms.formula"                0.96     0.09       0.96      0.09
"as.list.data.frame"           0.46     0.04       0.46      0.04
"makepredictcall.default"      0.46     0.04       0.46      0.04
"rpart"                        0.44     0.04    1085.38     99.99
"[[.data.frame"                0.16     0.01       0.42      0.04
"<Anonymous>"                  0.16     0.01       0.18      0.02
"match"                        0.14     0.01       0.22      0.02
"print"                        0.12     0.01       0.12      0.01
"model.matrix.default"         0.10     0.01       0.44      0.04
....

$by.total
                          total.time total.pct self.time self.pct
"rpart"                      1085.38     99.99      0.44     0.04
"rpart.matrix"               1081.78     99.66    294.26    27.11
"[[<-"                        786.62     72.47      0.06     0.01
"[[<-.data.frame"             786.56     72.46    786.46    72.45
"model.frame.default"           3.00      0.28      1.04     0.10
"eval"                          3.00      0.28      0.04     0.00
"eval.parent"                   3.00      0.28      0.00     0.00
"model.frame"                   3.00      0.28      0.00     0.00
"terms.formula"                 0.96      0.09      0.96     0.09
"terms"                         0.96      0.09      0.00     0.00
"makepredictcall"               0.50      0.05      0.04     0.00
"as.list.data.frame"            0.46      0.04      0.46     0.04
"makepredictcall.default"       0.46      0.04      0.46     0.04
"as.list"                       0.46      0.04      0.00     0.00
"vapply"                        0.46      0.04      0.00     0.00
"model.matrix.default"          0.44      0.04      0.10     0.01
"[["                            0.44      0.04      0.02     0.00
"model.matrix"                  0.44      0.04      0.00     0.00
....

$sample.interval
[1] 0.02

$sampling.time
[1] 1085.5

Note from the above that a big chunk of time is spent in function rpart.matrix:

> rpart:::rpart.matrix
function (frame) 
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    for (i in 1:ncol(frame)) {
        if (is.character(frame[[i]])) 
            frame[[i]] <- as.numeric(factor(frame[[i]]))
        else if (!is.numeric(frame[[i]])) 
            frame[[i]] <- as.numeric(frame[[i]])
    }
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

But it is the for loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Looks like it isn't `model.matrix` per se, but `rpart:::rpart.matrix` which has to expand the model shorthand `y~.` in a for loop. Probably the same thing that causes people to recommend against the formula interface in random forests. – joran Jun 19 '13 at 15:41
  • +1 @joran Indeed - I was profiling it but was taking a while on my computer. Have now added these details. – Gavin Simpson Jun 19 '13 at 16:31
  • @GavinSimpson thank you, Gavin. yes, Rprof gives a similar result for me too thus identifying the culprit as rpart.matrix. – stas g Jun 20 '13 at 14:57
  • @joran i am not sure if this is a trivial question, but is it actually possible to forgo the formula syntax if the function argument requires it? – stas g Jun 20 '13 at 15:00
  • @stasg I am actually a little surprised that there is (apparently) no non-formula interface for `rpart` in this case. I would investigate Hong's work around below. – joran Jun 20 '13 at 15:05
  • So this answer identifies the problem, but what is the solution? – littleO Nov 21 '21 at 22:51
6

Just building on @gavin simpson's discovery above... I decided to hack around with rpart.matrix, to see if I could do something about that excessive execution time.

The problem boils down to the use of a for loop. Normally I'm agnostic about for compared to [sl]apply; the latter is generally considered more elegant, but I'm not going to replace a for when it's working fine, just for that. In particular I think the performance benefits of *apply are sometimes overhyped; for has been improved significantly in terms of speed and memory use compared to the old S-Plus days.

Not in this case though. Simply replacing the for with lapply cuts the run time for this example by >2 orders of magnitude. Would be good to see if others can confirm this.

m <- model.frame(x.fac)


# call rpart.matrix
system.time(mm <- rpart:::rpart.matrix(m))
   user  system elapsed 
 208.25   88.03  296.99 


# exactly the same as rpart.matrix, but with for replaced by lapply
f <- function(frame)
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    frame[] <- lapply(frame, function(x) {
        if (is.character(x))
            as.numeric(factor(x))
        else if(!is.numeric(x))
            as.numeric(x)
        else x
    })
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

system.time(mm2 <- f(m))
   user  system elapsed 
   0.65    0.04    0.70 


identical(mm, mm2)
[1] TRUE
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • thank you for this! i have run this on my machine and go similar results `248.64 170.68 421.89` for the original looped function versus `0.81 0.20 1.57 ` for the rewritten lapply one. i think, far from being an exception, this is a testimony to the fact that the apply family is way faster than loops. and on top of that they take advantage of the vectorised nature of the R language. i found that the call to `model.frame` itself takes prohibitively long though. and seeing that `rpart.matrix` is a hidden function, i am not sure where this really leaves me now! – stas g Jun 20 '13 at 15:04
  • @stasg Be careful about concluding that for loops are slow! That wasn't really the case here. The issue was the _code in the for loop_, not the loop itself. It was all the assignments taking place in the loop. `lapply`'s anonymous function sidesteps a lot of the object copying that induces, that's all. I probably could construct a for loop that is written slightly differently, but also much faster. – joran Jun 20 '13 at 15:07
  • I dare you to submit this as a suggested improvement to the package maintainer! ;) – joran Jun 20 '13 at 15:11
  • @stasg the issue here is not the speed of the loop (although `lapply()` will run the looping infrastructure faster [slightly] than `for()`, that will only have an effect if the code inside the loop is trivial, otherwise that component dominates) but as Joran mentions the computational cost of the function calls within the loop. The `lapply` call is eating away all bar one of the calls to `[[<.data.frame()` (there is a single R-side call to `[<-.data.frame()` left over). Data frames are notoriously slow to work with and I think this issue highlights the problem nicely. – Gavin Simpson Jun 20 '13 at 15:45
  • +1 This is an excellent modification. Before you take @joran up on his dare I would be certain that the code is not doing anything odd, that you've run it against all the code examples and tests (if any) in the **rpart** package and aren't relying upon undocumented behaviour of `[<-.data.frame` - my reading of `?[<-.data.frame` would seem to me that it indicates that your usage is correct **and** more importantly documented correct. – Gavin Simpson Jun 20 '13 at 15:50