i am currently using the rpart
package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.
from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).
however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?
here is a sample code emulating my data:
library(rpart)
x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)
x.fac <- as.data.frame(lapply(x, factor))
now compare:
system.time(rpart( y ~ ., data = x, method = 'anova'))
user system elapsed
1.62 0.21 1.85
system.time(rpart( y ~ ., data = x.fac, method = 'anova'))
user system elapsed
246.87 165.91 412.92
dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart
behaviour. any clarifications/suggestions would be very apprecaited.