How to run a single regression function in R using a single function on each of a large number of datasets within a folder

Question

I need to run the enet() function from the elasticnet library in RStudio on each of these 47,000 datasets individually because they have been created in such a way that we know what the real underlying population for each dataset is and want to see how often the new algorithm finds that vs LASSO and Stepwise and the runtime of each.

I have absolutely no idea how to do this or even what search terms to use to look it up, I have already tried in both Google and Bing several times. I believe that the only packages my code as it stands requires are:

leaps
lars
stats
plyr
dplyr
readr
elasticnet

This is my code to run the LASSO (obviously, I made up the dataframe names for the x & y arguments in the enet() function for this post/question lol):

## Attempt 2: Run a LASSO regression using 
## the enet function from the elasticnet library
set.seed(11)   
library(elasticnet)
enet_LASSO <- enet(x = as.matrix(df_all_obs_on_all_of_the_IVs), 
                                  y = df_all_obs_on_the_DV, 
                                  lambda = 0, normalize = FALSE)
print(enet_LASSO)
# In order to ascertain which predictors/regressors are still
# included in the version of the model after running a 
# LASSO regression on it for the purpose of variable selection, 
# I am going to use the 'predict' method from the stats package.
LASSO_coeffs <- predict(enet_LASSO, 
                         x = as.matrix(df_all_obs_on_all_of_the_IVs),
                         s = 0.1, mode = "fraction", type = "coefficients")
print(LASSO_coeffs)

Optional background context & motivation: I am in the middle of a research project and in order to compare a new statistical learning procedure for choosing the optimal regression specification, I am running this new algorithm as a Monte Carlo Experiment in which I run it and two benchmarks (LASSO & Stepwise) on a synthetic dataset my collaborator created for me which consists of a multiple GB file folder filled with 47,000 individual csv files.

Please ask one question at a time and explain specifically what you need help with on that basis. For example, it is unclear here if you (a) don't know how to fit a model to many datasets, (b) don't know how to fit the models you specified, (c) don't know how to store the results and compare them, (d) don't know how to benchmark the runs, etc. — socialscientist, Aug 12 '22 at 04:10
https://meta.stackexchange.com/questions/222735/can-i-ask-only-one-question-per-post — socialscientist, Aug 12 '22 at 11:10
My post asks one single question, how to run a regression function in R on each of a large number of different datasets all within the same folder. It asks nothing else. — Marlen, Aug 12 '22 at 11:51

score 0 · Accepted Answer · answered Aug 12 '22 at 04:08

0

Listing all the files, applying a read function, and a corresponding enet should do the trick provided you have enough RAM. Here is what the code would look like:

file_list <- list.files(directory_path, full.names = TRUE, recursive = TRUE)
csvs <- lapply(file_list, function(x) read_csv(x))
names(csvs) <- file_list
tib <- tibble::enframe(csvs) %>%
  mutate(enet_column = lapply(value, function(y) {your_function_contents_relative_to_y_here})) %>%
  tidyr::unnest() #This step is optional

You may wish to just lapply your custom function to the csvs list instead, and form the tibble at the end. One lapply to form a list of enet objects, another to store the lasso coefficients. Let me know if you have any questions.

answered Aug 12 '22 at 04:08

dcsuka

2,922
3
6
27

I have found your answer extremely helpful as a guide & have been able to replicate it in a way that would work for most regression functions I am guessing. But the enet() function requires the x argument to be formatted as a matrix which has derailed me. So, this is what I did just to see if it worked on a smaller folder with less datasets. I ended up taking your advice about using an lapply function to the csvs object as well, so I actually have 2 of them. My code after file list which is identical to yours is (next comment, word count): – Marlen Aug 12 '22 at 22:05
csvs3 <- lapply(file_list2, function(i) {read.csv(i)}) LASSO_3_fits <- lapply(csvs3, function(i) { enet(x = as.matrix(csvs3[[i]][-1]), y = csvs3[[i]]$Y, lambda = 0, normalize = FALSE) }) But I keep getting this error, so frustrating! Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': invalid subscript type 'list' – Marlen Aug 12 '22 at 22:07
And these were the first two lines in case anyone was curious: directory_path2 <- "C:/Users/spenc/Documents/DAEN 698 - 2022/sample_obs2" file_list2 <- list.files(path = directory_path2, full.names = TRUE, recursive = TRUE) – Marlen Aug 12 '22 at 22:08
I would need some sample data for a definite conclusion, but it looks like in the `LASSO_3_fits <- lapply(csvs3, function(i) { enet(x = as.matrix(csvs3[[i]][-1]), y = csvs3[[i]]$Y, lambda = 0, normalize = FALSE) })` the function(i) clause should just be written as a function of i assuming that i is one data frame in the list. `csvs3[[i]][-1]` should just be i[-1], or something like that. `csvs3[[i]]$Y` should just be i$Y as well, I believe. Fine tune to your data frame structure though. – dcsuka Aug 12 '22 at 23:24
Ahh, I think I see what you are getting at, but I am still new here. Is there some way I can post a small file folder of 10 or 20 datasets for you in a comment or send it to you? – Marlen Aug 13 '22 at 03:21
Actually, I realized the real problem is sort of the inverse of what I thought it was several hours ago. It wasn't that I was failing to convert the list/dataframe in the x argument into a matrix when I was running the as.matrix function as I thought, it was that I was doing that, so my syntax to subset a dataframe was no longer correct for a matrix. Hope that clarifies it. – Marlen Aug 13 '22 at 03:25
Sure, just email it and all of your code to davecsuka@gmail.com – dcsuka Aug 13 '22 at 03:53

How to run a single regression function in R using a single function on each of a large number of datasets within a folder

1 Answers1