fitting a linear mixed model to a very large data set

Question

I want to run a mixed model (using lme4::lmer) on 60M observations of the following format; all predictor/dependent variables are categorical (factors) apart from the continuous dependent variable tc; patient is the grouping variable for a random intercept term. I have 64-bit R and 16Gb RAM and I'm working from a central server. RStudio is the most recent server version.

model <- lmer(tc~sex+age+lho+atc+(1|patient),
              data=master,REML=TRUE)

lho sex tc      age         atc patient
18  M   16.61   45-54       H   628143
7   F   10.52   12-15       G   2013855
30  M   92.73   35-44       N   2657693
19  M   24.92   70-74       G   2420965
12  F   17.44   65-69       A   2833610
31  F   7.03    75 and over A   1090322
3   F   28.59   70-74       A   2718649
29  F   4.09    75 and over C   384578
16  F   67.22   65-69       R   1579355
23  F   7.7     70-74       C   896374

I'm getting a cannot allocate a vector of 25.5Gb error. I'm assigned 40Gb on the server and am using 25 so I guess that means I need another 10 or so. I don't think I can get any extra space assigned.

I don't know the first thing about parallel processing except that I'm using one of four cores at the moment. Can anyone suggest parallel code for this model, or perhaps a different fix?

First of all, the "allocate vector" means RAM allocation, so the server disk memory is irrelevant. Next, you really need to take the time to learn about packages such as `parallel` and `snow` and `bigdata`.Otherwise, even if someone writes a tool for you, you won't understand how to mod it or to find speed bottlenecks. — Carl Witthoft, Jul 16 '15 at 11:45
Time is against me Carl, that's why I'm posting the question. — steve, Jul 16 '15 at 11:55
Have a look at your design matrix. I'd guess that it's huge. You need to consider carefully if the model you are trying to fit is appropriate. E.g., do you really need each `atc` level? Could you model `atc` as a random effect? Or possibly agglomerate `atc` levels? Also, have you checked that your numeric variables are `numeric` in R? Anyway, parallelization will not help with the memory problem. In contrast, since each CPU will need RAM for its task, parallelization will exacerbate the problem. — Roland, Jul 16 '15 at 12:19
Thank you Roland. I can't lose anything on atc. Hadn't thought of that issue of RAM. — steve, Jul 16 '15 at 12:45

Ben Bolker · Accepted Answer · 2015-07-16T19:08:38.793

As pointed out by Carl Witthoft, the standard parallelization tools in R use a shared memory model, so they will make things worse rather than better (their main purpose is to accelerate compute-bound jobs by using multiple processors).

In the short term, you might be able to save some memory by treating the categorical fixed-effect predictors (age, atc) as random effects but forcing their variances to be large. I don't know if this will be enough to save you or not; it will compress the fixed-effect model matrix a lot, but the model frame will still be stored/replicated with the model object ...

dd1 <- read.table(header=TRUE,
text="lho sex tc      age         atc patient
18  M   16.61   45-54       H   628143
7   F   10.52   12-15       G   2013855
30  M   92.73   35-44       N   2657693
19  M   24.92   70-74       G   2420965
12  F   17.44   65-69       A   2833610
31  F   7.03    75_and_over A   1090322
3   F   28.59   70-74       A   2718649
29  F   4.09    75_and_over C   384578
16  F   67.22   65-69       R   1579355
23  F   7.7     70-74       C   896374")
n <- 1e5
set.seed(101)
dd2 <- with(dd1,
      data.frame(tc=rnorm(n,mean=mean(tc),sd=sd(tc)),
                 lho=round(runif(n,min=min(lho),max=max(lho))),
                 sex=sample(levels(sex),size=n,replace=TRUE),
                 age=sample(levels(age),size=n,replace=TRUE),
                 atc=sample(levels(atc),size=n,replace=TRUE),
                 patient=sample(1:1000,size=n,replace=TRUE)))
library("lme4")
m1 <- lmer(tc~sex+(1|lho)+(1|age)+(1|atc)+(1|patient),
           data=dd2,REML=TRUE)

Random effects are automatically sorted in order from largest to smallest number of levels. Following the machinery described in the ?modular help page:

lmod <- lFormula(tc~sex+(1|lho)+(1|age)+(1|atc)+(1|patient),
                  data=dd2,REML=TRUE)
names(lmod$reTrms$cnms)  ## ordering
devfun <- do.call(mkLmerDevfun, lmod)
wrapfun <- function(tt,bigsd=1000) {
    devfun(c(tt,rep(bigsd,3)))
}
wrapfun(1)
opt <- optim(fn=wrapfun,par=1,method="Brent",lower=0,upper=1000)
opt$fval <- opt$value  ## rename/copy
res <- mkMerMod(environment(devfun), opt, lmod$reTrms, fr=lmod$fr)
res

You can ignore the reported variances for the categorical terms, and use ranef() to recover their (unshrunk) estimates.

In the long term, the proper way to do this problem is probably to parallelize it with a distributed-memory model. In other words, you would want to parcel the data out in chunks to different servers; use the machinery described in ?modular to set up a likelihood function (actually a REML-criterion function) that gives the REML criterion for a subset of the data as a function of the parameters; then run a central optimizer that takes a set of parameters and evaluates the REML criterion by submitting the parameters to each server, retrieving the values from each server, and adding them. The only two problems I see with implementing this are (1) I don't actually know how to implement distributed-memory computation in R (based on this intro document it seems that the Rmpi/doMPI packages might be the right way to go); (2) in the default way that lmer is implemented, the fixed-effects parameters are profiled out rather than being explicitly part of the parameter vector.

Thanks so much for your input Ben. It will we take me some time to parse all of that out but I'll give it a try. — steve, Jul 17 '15 at 09:31
Just got some extra space released on the server but now I'm getting this error message: Error in qr.default(X, tol = tol, LAPACK = FALSE) : too large a matrix for LINPACK Can anyone shed any light on this? — steve, Jul 17 '15 at 14:50
you can try `traceback()` to see where the problem is. My guess is that you're running into an issue with the stage checking that the fixed effect model matrix (X) is of full rank. You can skip this step (`control=lmerControl(check.rankX="ignore")`), but I'm guessing that you will run into more trouble very shortly thereafter. — Ben Bolker, Jul 17 '15 at 16:29
Thanks Ben, here's the output after running the second line you gave.8: stop("too large a matrix for LINPACK") 7: qr.default(X, tol = tol, LAPACK = FALSE) 6: qr(X, tol = tol, LAPACK = FALSE) 5: chkRank.drop.cols(X, kind = rankX.chk, tol = 1e-07) — steve, Jul 20 '15 at 11:21
4: lme4::lFormula(formula = tc ~ sex + age + lho + atc + (1 | patient), data = master, REML = TRUE, control = list(optimizer = "bobyqa", restart_edge = TRUE, boundary.tol = 1e-05, calc.derivs = TRUE, use.last.params = FALSE, checkControl = list(check.nobs.vs.rankZ = "ignore", check.nobs.vs.nlev = "stop", check.nlev.gtreq.5 = "ignore", check.nlev.gtr.1 = "stop", check.nobs.vs.nRE = "stop", check.rankX = "message+drop.cols", check.scaleX = "warning", check.formula.LHS = "stop"), checkConv = — steve, Jul 20 '15 at 11:21
list(check.conv.grad = list( action = "warning", tol = 0.002, relTol = NULL), check.conv.singular = list(action = "ignore", tol = 1e-04), check.conv.hess = list(action = "warning", tol = 1e-06)), optCtrl = list())) 3: eval(expr, envir, enclos) 2: eval(mc, parent.frame(1L)) 1: lmer(tc ~ sex + age + lho + atc + (1 | patient), data = master, REML = TRUE — steve, Jul 20 '15 at 11:21
so turn off the `check.conv.grad` warning too: `check.conv.grad="ignore"` in `lmerControl()`. (And similarly with any other pre-conditioning checks that fail due to lack of memory ...) — Ben Bolker, Jul 20 '15 at 12:36
Thanks for your help Ben. Still stuck but I'm going to try a few more things tomorrow. — steve, Jul 20 '15 at 17:00

fitting a linear mixed model to a very large data set

1 Answers1

Linked