How to speed up linear regression of large number of variable combinations?

Question

I have a data.frame with 36365760 rows and 10 columns, which looks something like this:

dat3 <- data.frame("Region"=rep(c("R1","R2","R3","R1","R2"),20),
                   "Phase"=rep(c("S1","S2"),50),
                   "Treatment"=rep(c("P","D"),50),
                   "Region_ID"=rep(1:2,50),
                   "Signal"=rnorm(100),
                   "Bin"=rep(1,100))

I then fit a model for each combination of variables of interest:

res <- lapply(unique(dat3$Region), function(i){
  lapply(unique(dat3$Phase), function(j){
    lapply(unique(dat3$Treatment),function(k){
      lapply(unique(dat3$Region_ID[dat3$Region == i & dat3$Phase == j & dat3$Treatment == k]),function(l){
        y=dat3[dat3$Region==i & dat3$Phase==j & dat3$Treatment ==k & dat3$Region_ID ==l & dat3$Bin %in% c(1:10,90:100),]$Signal
        x=dat3[dat3$Region==i & dat3$Phase==j & dat3$Treatment ==k & dat3$Region_ID ==l & dat3$Bin %in% c(1:10,90:100),]$Bin
        lm(y~x)
      })
    })
  })
})

I am performing this on a computing cluster but it did not finish overnight, however when I subset the full data.frame I runs without error.

What would you do better?

Why do you fit a model for each value of the variables of interest, instead of modelling the same with dummy variables? What do you ultimately want to achieve? — mhovd, Oct 04 '22 at 21:16
Each combination of variables constitutes an individual scatter plot (x=bin, y=signal). I need the fit to create extrapolated data via predict for each of the combinations. — gdeniz, Oct 04 '22 at 21:26
If you want a less painful to code approach, I'd recommend using `dplyr` to do the grouping. The `group_map` help page has an example of fitting a model to each group (and groups can be defined as unique values of many columns). But also mhovd has a great point: if all you are interested in is predictions there is no difference between fitting individual models to each group and fitting a full model `y ~ x * Region * Phase * Treatment`--but fitting one big model will be more efficient than many small ones. — Gregor Thomas, Oct 05 '22 at 02:17
[These answers](https://stackoverflow.com/q/25416413/903061) suggest higher efficiency `lm` implementations. — Gregor Thomas, Oct 05 '22 at 02:24
Thanks for everyones' feedback. I tried the faster or more convenient lm versions and they still running 2days on node and counting. I am interested in the single model solution. Could someone guide me through the essential steps on how to do this? How will I be able to perform a predict call with say: Phase=S1, Treatment=D, Region=R1 for bins 11:89 (which were intentionally not used for fit). What would be the model call? Literally y ~x * Region * Phase * Treatment? after subsetting for bins c(1:10,90:100)? — gdeniz, Oct 07 '22 at 20:10

How to speed up linear regression of large number of variable combinations?

0 Answers0