I'm training a random forest to multilevel data, essentially treating it like a nonparametric regression model. I'm accounting for the unobserved group-level heterogeneity as a correction outside of the random forest training process. But during the random forest training process, I want each tree to be grown from a random bootstrap sample (or subsample) of cross-sectional units rather than observations. So, say that my data is multiple observations of many individuals, I want to bootstrap individuals, not observations of individuals.
The dummy example below shows that strata
doesn't solve my problem.
> N <- 1000
> p <- 100
> A <- matrix(rnorm(p^2),p)
> library(MASS)
> X <- mvrnorm(N, rep(0,p), A %*%t(A))
> B <- rnorm(p)
> fac <- sample(1:1000 %% 10 +1)
> y <- log(fac + exp(X%*%B)^{1/fac}) + rnorm(N, sd = 10)
> fac <- as.factor(fac)
> library(randomForest)
> forest <- randomForest(y = y, x = cbind(X, fac), ntree = 1, keep.inbag = TRUE, replace = FALSE
+ , strata = fac #Stratify by the factor
+ )
> sum(forest$inbag[fac == '1'])
[1] 62
> sum(forest$inbag[fac == '2'])
[1] 60
> sum(forest$inbag[fac == '3'])
[1] 60
> sum(forest$inbag[fac == '4'])
[1] 64
> sum(forest$inbag[fac == '5'])
[1] 64
> sum(forest$inbag[fac == '6'])
[1] 65
> sum(forest$inbag[fac == '7'])
[1] 54
> sum(forest$inbag[fac == '8'])
[1] 72
> sum(forest$inbag[fac == '9'])
[1] 62
> sum(forest$inbag[fac == '10'])
[1] 69
Alternatively, I could grow single random-subspace trees to a sample that I select myself, and then manually combine the trees. This is below.
"%ni%" <- Negate("%in%")
library(foreach)
rf_cluster_bootstrap <- foreach(j = 1:10) %do% {
set.seed(j)
sampfac <- sample(unique(fac), replace = TRUE)
unsampfac <-unique(fac[fac %ni% sampfac])
Xt <- foreach(i = sampfac, .combine = rbind) %do% {Xmat[Xmat$fac == i,]}
Xt$fac <- NULL
fj <- randomForest(y = y, x = Xt, ntree = 1, sampsize = nrow(Xt), replace = FALSE, keep.inbag = TRUE)
Xnt <- foreach(i = unsampfac, .combine = rbind) %do% {Xmat[Xmat$fac == i,]}
Xnt$fac <- NULL
pred <- predict(fj, newdata = Xnt)
oob_outvec <- rep(NA, N)
oob_outvec[as.numeric(names(pred))] <- pred
return(list(fj = fj, oob = oob_outvec))
}
While this seems to work as far as it goes, I'd need to write my own predict functions, keep track of row names, etc. etc. etc. There will probably be coding errors and other unexpected things. For example, here is a function to combine the output:
combineFunc <- function(x){
rflist <- lapply(x, `[[`, 'fj')
# rf <- randomForest::combine(rflist)#I don't know why this doesn't work
rf <- foreach(i = 1:10, .combine = randomForest::combine) %do% {rflist[[i]]}
return(rf)
}
xx <- combineFunc(rf_cluster_bootstrap)
head(xx$inbag)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
101 0 1 0 3 1 0 0 0 0 0
102 3 0 1 0 2 2 1 3 2 1
103 0 1 0 2 1 1 0 1 0 1
104 0 2 1 0 3 0 1 0 2 0
105 1 0 0 1 1 0 0 1 1 0
106 0 0 3 1 1 0 1 1 2 1
Basic things like the inbag
matrix are garbled. I can fix it, but I'm unlikely to catch everything.
Before I go and do this from scratch, I'd like to know if there is there something already implemented that does what I am trying to do? Or a simpler/more elegant way to do it?
(This thread is similar, but it uses rpart
, which can't handle random subspaces)