I'm working on trying to make my model fitting procedure in R more efficient. Currently, I have all of my data generated with 1500 sims for 15 variables. This data is stored in an array, with each level being one sim, each row being one "person" and each column being one of the 15 variables (eg., 300 x 15 x 1500). I then pass one layer of the array through mplusObject
numerous times, fitting different LPA models (one class, two class, etc). For each of these models, there are numerous outcomes that get reported and saved. I've been working for a while now trying to figure out how to speed this up using parallel processing given that the data is pre-generated and one layer of the array doesn't depend on the other. I'll show what I currently have below, but it isn't working, so I'm wondering if I need a different package. Thanks!
inp <- array(1:(300*15*1500), dim=(300,15,1500)) #Really there's actual data here, not random values, but the data generation process is a whole other thing.
results <- results = matrix(NA,1500,129) #A results table for values to be written to, filled with NAs, 1500 simulations, 129 results.
num_sims=1500
foreach(i=1:num_sims, .packages=c('mclust','MplusAutomation')) %dopar% {
working <- inp[,,i]
sim_num=i
results[sim_num,1] = working[1,17] #number of groups
results[sim_num,2] = working[1,18] #sample size 1
results[sim_num,3] = working[1,19] #sample size 2
results[sim_num,4] = working[1,20] #sample size 3
results[sim_num,5] = working[1,21] #dist2
results[sim_num,6] = working[1,22] #dist3
df <- as.data.frame(working[,1:15])
lpa1_15 <- mplusObject(
TITLE = "1-Class LPA;",
VARIABLE = "USEVARIABLES = x01-x15;
CLASSES=c(1);",
ANALYSIS = "ESTIMATOR = MLR;
TYPE=MIXTURE;",
MODEL = "
%OVERALL%
x01-x15;
[x01-x15];
%c#1%
x01-x15;
[x01-x15];",
usevariables = c("x01", "x02", "x03", "x04", "x05",
"x06", "x07", "x08", "x09", "x10",
"x11", "x12", "x13", "x14", "x15"),
rdata = df)
lpa1_15_fit = mplusModeler(lpa1_15, "df.dat", modelout = "lpa1_15.inp", killOnFail = FALSE, run = 1L)
if (!is.null(lpa1_15_fit$results$summaries$LL)){
results[sim_num,7] = -2 * lpa1_15_fit$results$summaries$LL
results[sim_num,8] = lpa1_15_fit$results$summaries$BIC
results[sim_num,9] = lpa1_15_fit$results$summaries$aBIC
results[sim_num,10] = lpa1_15_fit$results$summaries$AIC
results[sim_num,11] = lpa1_15_fit$results$summaries$AICC}
lpa2_15 <- mplusObject(
TITLE = "2-Class LPA;",
VARIABLE = "USEVARIABLES = x01-x15;
CLASSES=c(2);",
ANALYSIS = "ESTIMATOR = MLR;
TYPE=MIXTURE;",
MODEL = "
%OVERALL%
x01-x15;
[x01-x15];
%c#1%
x01-x15;
[x01-x15];
%c#2%
x01-x15;
[x01-x15];",
OUTPUT = "TECH11;",
usevariables = c("x01", "x02", "x03", "x04", "x05",
"x06", "x07", "x08", "x09", "x10",
"x11", "x12", "x13", "x14", "x15"),
rdata = df)
lpa2_15_fit = mplusModeler(lpa2_15, "df.dat", modelout = "lpa2_15.inp", killOnFail = FALSE, run = 1L)
if (!is.null(lpa2_15_fit$results$summaries$LL)){
results[sim_num,12] = -2 * lpa2_15_fit$results$summaries$LL
results[sim_num,13] = lpa2_15_fit$results$summaries$BIC
results[sim_num,14] = lpa2_15_fit$results$summaries$aBIC
results[sim_num,15] = lpa2_15_fit$results$summaries$AIC
results[sim_num,16] = lpa2_15_fit$results$summaries$AICC
results[sim_num,17] = lpa2_15_fit$results$summaries$Entropy
if (!is.null(lpa2_15_fit$results$summaries$T11_VLMR_2xLLDiff)){
results[sim_num,18] = lpa2_15_fit$results$summaries$T11_VLMR_2xLLDiff
results[sim_num,19] = lpa2_15_fit$results$summaries$T11_VLMR_PValue
results[sim_num,20] = lpa2_15_fit$results$summaries$T11_LMR_Value
results[sim_num,21] = lpa2_15_fit$results$summaries$T11_LMR_PValue}
... and so on...
}
The results I got from running this were:
[[1]]
[1] 0.491
[[2]]
[1] 0.7037
I've tried using parallel, foreach and dopar, and parLapply, but just can't get them to work. The closest I got was using the foreach function, but that returned a single value for each and none of the results were saved to the results table. I can provide the code for how I attempted these, but none of them worked really, so at this point I'm questioning if it can be done (and if so, which method/approach is best for this setup).
I should also point out that the levels of data can be run in any order (eg., [,,1], [,,5], [,,3]) is okay, but once that level is called the full function (or however it should be set up) should be run, as several tests compare the current model to the previous model (3 classes vs 2 classes) for that dataset, so in that sense the data does have to be run in order.
Thanks for any help or suggestions you might have!