0

All of the code in this question can be found in my GitHub Repository for this research project on Estimated Exhaustive Regression. Specifically, in the "Both BE & FS script" and "LASSO code" Rscripts, and you may use the significantly truncated file folder of datasets "sample_obs(20)" rather than "spencer" because the former only contains 20 csvs while the latter contains 58.5k!

I am running both a Backward Elimination and a Forward Selection Stepwise Regression on each of N different csv file formatted datasets within a file folder using the following code (once the N datasets have already been loaded):

set.seed(11)      # for reproducibility
full_models <- vector("list", length = length(datasets))
BE_fits <- vector("list", length = length(datasets))
head(BE_fits, n = 3)   # returns a list with 18 elements, all of which are NULL

set.seed(11)      # for reproducibility
for(i in seq_along(datasets)) {
  full_models[[i]] <- lm(formula = Y ~ ., data = datasets[[i]])
  BE_fits[[i]] <- step(object = full_models[[i]], 
                        scope = formula(full_models[[i]]),
                        direction = 'backward',
                        trace = 0) }

And to get the final results I want, I use the following:

BE_Coeffs <- lapply(seq_along(BE_fits), function(i) coef(BE_fits[[i]]))
    
Models_Selected_by_BE <- lapply(seq_along(BE_fits), 
                              \(i) names(coef(BE_fits[[i]])))

And for FS Stepwise, I used:

set.seed(11)      # for reproducibility
FS_fits <- vector("list", length = length(datasets))
head(FS_fits, n = 3)   # returns a list with 15 elements, all of which are NULL

set.seed(11)      # for reproducibility
for(j in seq_along(datasets)) { null_models[[j]] = lm(formula = Y ~ 1, 
                                                     data = datasets[[j]]) 
                                FS_fits[[j]] = step(object = null_models[[j]],
                                                    direction = 'forward',
                       scope = formula(full_models[[j]]), trace = 0) }

Much of the syntax of this code I got from previous questions I asked here several months ago, but now I am rerunning all of my models on a new file folder filled with new randomly generated synthetic datasets, and I don't want to re-run this using this code because last time, it took WELL OVER 12 or 14 hours for both the BE and the FS stepwise procedures to finish running.

p.s. I already was able to avoid using a loop when I did the same thing instead for LASSO Regression as my 1st Benchmark Variable Selection Procedure using the following code which employed a function from R's useful apply family (this only takes 2-3 hours):

set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(datasets, function(i) 
               enet(x = as.matrix(select(i, starts_with("X"))), 
                    y = i$Y, lambda = 0, normalize = FALSE))

However, I could not figure out how to replicate something similar for either basic version of Stepwise because of the all important initialization step beforehand.

Marlen
  • 171
  • 11

0 Answers0