Using ForEach to Step Through columns for 1000's of regressions

Question

First some data. Make a dataframe for covariates and my outcome of interest for regression and one for explanatory variables.

What I am doing is stepping through the lm(outcome ~ mycovs + ith column of betas) and for this example, collecting the residuals.

set.seed(123) # for repeatability
mycovs = data.frame(outcome = rnorm(100,20,5),
                    race = rep(c("white","black","hispanic","other"),25),
                    income = rep(c("high","low"),50), 
                    age = rnorm(100,30,3))
betas = data.frame(replicate(10000,rnorm(100,50,6)/100))

To do this for every variable in betas I wrote this code:

get_resids <- function(x){
  mydata = cbind(mycovs,x)
  cpg = names(mydata)[ncol(mydata)]
  as.vector(resid(lm(formula(paste("outcome ~ as.factor(race) + as.factor(income) + age + ", cpg )),
                     data = mydata)))
  }
head(get_resids(betas[1]))
[1] -1.8525090 -0.7299173  6.4941289  0.5357159 -0.1771154  7.7554550

Then I can use do.call(lapply()) to generate a matrix of these residuals for each of the 10,000 variables in my betas data frame as follows.

system.time(
myresids <- do.call(cbind, lapply(betas, get_resids))
)
   user  system elapsed 
  20.63    0.06   20.76 
> dim(myresids)
[1]   100 10000
> myresids[1:5,1:10]
             X1         X2          X3         X4          X5          X6         X7          X8         X9        X10
[1,] -1.8525090 -3.2651298 -3.54352587 -3.2962217 -2.95237520 -2.52995146 -3.0971490 -3.07625585 -2.8306409 -2.6454698
[2,] -0.7299173 -1.7982698 -2.54966496 -1.8009449 -1.60265484 -0.35825398 -1.6771846 -1.55455681 -1.2834764 -1.0941130
[3,]  6.4941289  6.6330879  5.88252329  7.1254892  6.88332171  7.79059098  6.9549380  6.84726299  6.9756743  6.3790811
[4,]  0.5357159 -0.0629098  0.06064112  0.3261975 -0.05377268 -0.04489599  0.1968423  0.02764062  0.2472463 -0.6944623
[5,] -0.1771154  0.1974865  0.56104333 -0.1188214  0.40202835  1.37694954  0.2904445  0.22634565  1.0650977  0.3231615

Not bad. I am doing 10,000 regressions and storing the residuals from all of them in a matrix and it takes a little over 20 seconds. Note, that this is a single threaded operation that sequentially steps through 10,000 regressions.

Well these exposures are actually genetic CpG methylation scores and I have ~ a million of them to do, so I wanted to use foreach() and doParallel to multithread this and I have been unable to figure it out.

This is what I tried. I first broke up the betas matrix into 4 named dataframes with 1/4 the columns in each part:

mylist <- list(b1 = betas[1:2500], b2 = betas[2501:5000], b3 = betas[5001:7500], b4 = betas[7501:10000])
names(mylist); length(mylist)
[1] "b1" "b2" "b3" "b4"
[1] 4

Then I tried to implement the doParallel as follows:

myresids_par <- foreach(i = 1:length(mylist), .combine = "cbind") %dopar% {
    do.call(cbind, lapply(mylist[i], get_resids))
  }
stopCluster(cl)

But what I got was the following; just 4 sets of residuals as follows and I'm not sure what it did:

> dim(myresids_par)
[1] 100   4
> head(myresids_par)
             b1         b2         b3          b4
[1,] -1.1051559 -3.2815443 -4.0951682 -2.97181934
[2,] -1.7884883 -1.5842009 -2.2403507 -1.48095064
[3,]  6.0211664  6.8417766  7.0208282  6.93438155
[4,] -0.4692244  0.1247481  0.9653631 -0.08206986
[5,] -0.1857339  0.2945526  1.8936715  0.30034781
[6,]  8.7706564  7.9744631  8.5240021  8.05232223

F. Privé · Accepted Answer · 2018-03-28T07:00:40.307

1

The problem here is that mylist[i] is accessing a sub-list of length one (not the data frame stored in the i-th element of the list; you'll need mylist[[i]] instead).

So you can use:

myresids_par <- foreach(i = 1:length(mylist), .combine = "cbind") %dopar% {
  do.call(cbind, lapply(mylist[[i]], get_resids))
}

or better, just use:

myresids_par <- foreach(i = seq_along(mylist), .combine = "c") %dopar% {
  lapply(mylist[[i]], get_resids)
}

And then use do.call(cbind, myresids_par) if you want a matrix or just as.data.frame(myresids_par) if you want a data frame.

PS: note that lapply here works because a data frame is also a list. If you had matrices in your list, you would need to use apply(MAT, 2, FUN).

edited Mar 28 '18 at 07:00

answered Mar 28 '18 at 06:54

F. Privé

11,423
2
27
78

Thanks for solving my problem. I will note that `foreach()` was able to parse the columns automatically. So I didn't have to create a list object manually in the end. Your solution and `myresids_par <- foreach(i = 1:ncol(betas), .combine = "cbind") %dopar% { do.call(cbind, lapply(betas[i], get_resids))` give identical output. Manually parsing the chunks saved .25 seconds. 7.6s vs. 7.85 for what I pasted above. There must be a lot of overhead in these operations as I'm getting a 3:1 improvement but have a 20 threaded processor. – akaDrHouse Mar 29 '18 at 13:02

Using ForEach to Step Through columns for 1000's of regressions

1 Answers1