5

I'm training a random forest to multilevel data, essentially treating it like a nonparametric regression model. I'm accounting for the unobserved group-level heterogeneity as a correction outside of the random forest training process. But during the random forest training process, I want each tree to be grown from a random bootstrap sample (or subsample) of cross-sectional units rather than observations. So, say that my data is multiple observations of many individuals, I want to bootstrap individuals, not observations of individuals.

The dummy example below shows that strata doesn't solve my problem.

> N <- 1000
> p <- 100
> A <- matrix(rnorm(p^2),p)
> library(MASS)
> X <- mvrnorm(N, rep(0,p), A %*%t(A))
> B <- rnorm(p)
> fac <- sample(1:1000 %% 10 +1)
> y <- log(fac + exp(X%*%B)^{1/fac}) + rnorm(N, sd = 10)
> fac <- as.factor(fac)
> library(randomForest)
> forest <- randomForest(y = y, x = cbind(X, fac), ntree = 1, keep.inbag = TRUE, replace = FALSE
+  , strata = fac                        #Stratify by the factor
+ )
> sum(forest$inbag[fac == '1'])
[1] 62
> sum(forest$inbag[fac == '2'])
[1] 60
> sum(forest$inbag[fac == '3'])
[1] 60
> sum(forest$inbag[fac == '4'])
[1] 64
> sum(forest$inbag[fac == '5'])
[1] 64
> sum(forest$inbag[fac == '6'])
[1] 65
> sum(forest$inbag[fac == '7'])
[1] 54
> sum(forest$inbag[fac == '8'])
[1] 72
> sum(forest$inbag[fac == '9'])
[1] 62
> sum(forest$inbag[fac == '10'])
[1] 69

Alternatively, I could grow single random-subspace trees to a sample that I select myself, and then manually combine the trees. This is below.

"%ni%" <- Negate("%in%")
library(foreach)
rf_cluster_bootstrap <- foreach(j = 1:10) %do% {
  set.seed(j)
  sampfac <- sample(unique(fac), replace = TRUE)
  unsampfac <-unique(fac[fac %ni% sampfac])
  Xt <- foreach(i = sampfac, .combine = rbind) %do% {Xmat[Xmat$fac == i,]}
  Xt$fac <- NULL
  fj <- randomForest(y = y, x = Xt, ntree = 1, sampsize = nrow(Xt), replace = FALSE, keep.inbag = TRUE)
  Xnt <- foreach(i = unsampfac, .combine = rbind) %do% {Xmat[Xmat$fac == i,]}
  Xnt$fac <- NULL
  pred <- predict(fj, newdata = Xnt)
  oob_outvec <- rep(NA, N)
  oob_outvec[as.numeric(names(pred))] <- pred
  return(list(fj = fj, oob = oob_outvec))
}

While this seems to work as far as it goes, I'd need to write my own predict functions, keep track of row names, etc. etc. etc. There will probably be coding errors and other unexpected things. For example, here is a function to combine the output:

combineFunc <- function(x){
  rflist <- lapply(x, `[[`, 'fj')
  # rf <- randomForest::combine(rflist)#I don't know why this doesn't work
  rf <- foreach(i = 1:10, .combine = randomForest::combine) %do% {rflist[[i]]}
  return(rf)
}

xx <- combineFunc(rf_cluster_bootstrap)
head(xx$inbag)
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
101    0    1    0    3    1    0    0    0    0     0
102    3    0    1    0    2    2    1    3    2     1
103    0    1    0    2    1    1    0    1    0     1
104    0    2    1    0    3    0    1    0    2     0
105    1    0    0    1    1    0    0    1    1     0
106    0    0    3    1    1    0    1    1    2     1

Basic things like the inbag matrix are garbled. I can fix it, but I'm unlikely to catch everything.

Before I go and do this from scratch, I'd like to know if there is there something already implemented that does what I am trying to do? Or a simpler/more elegant way to do it?

(This thread is similar, but it uses rpart, which can't handle random subspaces)

Community
  • 1
  • 1
generic_user
  • 3,430
  • 3
  • 32
  • 56

1 Answers1

1

I would've left this as a comment but I don't have enough reputation yet. It's not a definitive answer but I wanted to leave something for other folks that come across this post on a similar search like I did.

I haven't tried it personally but here is a relatively recent implementation of blocked bootstrapping for random forest in R I came across (blockForest https://rdrr.io/cran/blockForest/) that seems to be getting towards what you wanted to do.

Citation: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2942-y

dem336
  • 53
  • 6