I have a data.frame
with 36365760
rows and 10
columns, which looks something like this:
dat3 <- data.frame("Region"=rep(c("R1","R2","R3","R1","R2"),20),
"Phase"=rep(c("S1","S2"),50),
"Treatment"=rep(c("P","D"),50),
"Region_ID"=rep(1:2,50),
"Signal"=rnorm(100),
"Bin"=rep(1,100))
I then fit a model for each combination of variables of interest:
res <- lapply(unique(dat3$Region), function(i){
lapply(unique(dat3$Phase), function(j){
lapply(unique(dat3$Treatment),function(k){
lapply(unique(dat3$Region_ID[dat3$Region == i & dat3$Phase == j & dat3$Treatment == k]),function(l){
y=dat3[dat3$Region==i & dat3$Phase==j & dat3$Treatment ==k & dat3$Region_ID ==l & dat3$Bin %in% c(1:10,90:100),]$Signal
x=dat3[dat3$Region==i & dat3$Phase==j & dat3$Treatment ==k & dat3$Region_ID ==l & dat3$Bin %in% c(1:10,90:100),]$Bin
lm(y~x)
})
})
})
})
I am performing this on a computing cluster but it did not finish overnight, however when I subset the full data.frame
I runs without error.
What would you do better?