How to increase the efficiency of a for loop used to run Stepwise Regressions iteratively

Question

All of the code in this question can be found in my GitHub Repository for this research project on Estimated Exhaustive Regression. Specifically, in the "Both BE & FS script" and "LASSO code" Rscripts, and you may use the significantly truncated file folder of datasets "sample_obs(20)" rather than "spencer" because the former only contains 20 csvs while the latter contains 58.5k!

I am running both a Backward Elimination and a Forward Selection Stepwise Regression on each of N different csv file formatted datasets within a file folder using the following code (once the N datasets have already been loaded):

set.seed(11)      # for reproducibility
full_models <- vector("list", length = length(datasets))
BE_fits <- vector("list", length = length(datasets))
head(BE_fits, n = 3)   # returns a list with 18 elements, all of which are NULL

set.seed(11)      # for reproducibility
for(i in seq_along(datasets)) {
  full_models[[i]] <- lm(formula = Y ~ ., data = datasets[[i]])
  BE_fits[[i]] <- step(object = full_models[[i]], 
                        scope = formula(full_models[[i]]),
                        direction = 'backward',
                        trace = 0) }

And to get the final results I want, I use the following:

BE_Coeffs <- lapply(seq_along(BE_fits), function(i) coef(BE_fits[[i]]))
    
Models_Selected_by_BE <- lapply(seq_along(BE_fits), 
                              \(i) names(coef(BE_fits[[i]])))

And for FS Stepwise, I used:

set.seed(11)      # for reproducibility
FS_fits <- vector("list", length = length(datasets))
head(FS_fits, n = 3)   # returns a list with 15 elements, all of which are NULL

set.seed(11)      # for reproducibility
for(j in seq_along(datasets)) { null_models[[j]] = lm(formula = Y ~ 1, 
                                                     data = datasets[[j]]) 
                                FS_fits[[j]] = step(object = null_models[[j]],
                                                    direction = 'forward',
                       scope = formula(full_models[[j]]), trace = 0) }

Much of the syntax of this code I got from previous questions I asked here several months ago, but now I am rerunning all of my models on a new file folder filled with new randomly generated synthetic datasets, and I don't want to re-run this using this code because last time, it took WELL OVER 12 or 14 hours for both the BE and the FS stepwise procedures to finish running.

p.s. I already was able to avoid using a loop when I did the same thing instead for LASSO Regression as my 1st Benchmark Variable Selection Procedure using the following code which employed a function from R's useful apply family (this only takes 2-3 hours):

set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(datasets, function(i) 
               enet(x = as.matrix(select(i, starts_with("X"))), 
                    y = i$Y, lambda = 0, normalize = FALSE))

However, I could not figure out how to replicate something similar for either basic version of Stepwise because of the all important initialization step beforehand.

How to increase the efficiency of a for loop used to run Stepwise Regressions iteratively

0 Answers0