0

I comparing a new statistical learning algorithm which tries to find the optimal factors & overall model among all candidates for a paper and I need to compare it to the two main benchmark methods currently used, namely, LASSO & Stepwise Regression. I am going to run all 3 using the same set.seed value on the same file folder with 47,500 synthetic csv file datasets, each of which has 31 columns (30 candidate IVs/factors & 1 DV) i.e. a comparison via Monte Carlo Simulation.

Here is my code:

# the 2 lines below together create a simple character list of 
# all the file names in the file folder of datasets you created  
directory_path <- "~/DAEN_698/sample obs"
file_list <- list.files(path = directory_path, full.names = TRUE, 
                        recursive = TRUE)
# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
csvs <- lapply(file_list, read.csv)


# this function fits all 47,50[![enter image description here][1]][1]0 LASSO regressions
# and outputs the summary results when called
set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(csvs, function(i) {
  enet(x = as.matrix(i[-1]), y = i$Y, lambda = 0, normalize = FALSE)
})

# this stores and prints out all of the regression 
# equation specifications selected by LASSO when called
LASSO_Coeffs <- lapply(LASSO_fits, predict)

The problem (as you can see in the 1st attached screenshot[1st Screenshot]) is that even though the Environment in RStudio says that the csvs, LASSO_fits, & LASSO_Coeffs objects are all lists of 47,500 elements which indicated to me that they had correctly only fit one LASSO to each dataset, when I print one of them out, I get 30 rows of coefficient estimates for each dataset i, not 1 like I should. Is my mistake in the predict argument of the lapply function in the last line of code or somewhere else? In other words, the Coefficient Estimates fit for each individual dataset when printed out should look like the one on top in the second screenshot, but they actually look like the Console output below which is completely wrong (warning, the following output from the Console is extremely long/large, you have been warned):

> print(LASSO_coeffs1)
$s
[1] 0.1
$fraction
  0 
0.1 
$mode
[1] "fraction"
$coefficients
        X1         X2         X3         X4         X5         X6         X7         X8         X9 
0.20039732 0.13671726 0.12411170 0.06292652 0.07892046 0.00000000 0.00000000 0.00000000 0.00000000 
       X10        X11        X12        X13        X14        X15        X16        X17        X18 
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X19        X20        X21        X22        X23        X24        X25        X26        X27 
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X28        X29        X30 
0.00000000 0.00000000 0.00000000

> LASSO2_Coeffs[[1]][["coefficients"]]
           X1          X2         X3        X4         X5          X6           X7           X8
0  0.00000000 0.000000000 0.00000000 0.0000000 0.00000000 0.000000000  0.000000000  0.000000000
1  0.06450913 0.000000000 0.00000000 0.0000000 0.00000000 0.000000000  0.000000000  0.000000000
2  0.07198916 0.008349275 0.00000000 0.0000000 0.00000000 0.000000000  0.000000000  0.000000000
3  0.12366660 0.059560378 0.04936567 0.0000000 0.00000000 0.000000000  0.000000000  0.000000000
4  0.14511524 0.081678129 0.06892450 0.0000000 0.02097678 0.000000000  0.000000000  0.000000000
5  0.91860803 0.851771768 0.84108987 0.8804519 0.83171011 0.000000000  0.000000000  0.000000000
6  0.94120023 0.873851547 0.86377944 0.9080777 0.85608657 0.000000000  0.000000000  0.000000000
7  0.95081951 0.883359061 0.87335729 0.9205868 0.86614005 0.000000000  0.000000000  0.000000000
8  0.95697092 0.889083242 0.87974842 0.9283948 0.87281654 0.000000000  0.000000000  0.000000000
9  0.95782050 0.889802627 0.88061073 0.9294186 0.87368880 0.000000000  0.000000000  0.000000000
10 0.96887108 0.899273499 0.89153742 0.9436549 0.88563826 0.000000000  0.000000000  0.000000000

I have only included the first 10 rows for the first 8 factors, the full results stored in the coefficients component of the LASSO2_Coeffs object for each dataset is 500 rows for all 30 candidate factors! That is literally every single observation not on the Dependent Variable in every single one of the 47k datasets.

If you would like to verify any of this by running the code yourself you can find the script I ran it all with (called 'LASSO code (2)' & the file folder 'sample_obs2' in my Github account within the repository called "Estimated-Exhaustive-Regression-Project": enter link description here

Marlen
  • 171
  • 11
  • What code did you use to get the desired output? Also, please include all of your actual code, not screenshots. – dcsuka Aug 18 '22 at 02:42
  • The code I ran trying to get the desired output after creating a list to store the names of each csv file, & storing that it in the file_list object, then reading the data from each csv file & storing them in the csvs object. T hence, the LASSO_fits function was meant to fit a LASSO for each csv & the LASSO_Coeffs object was meant the which factors 'chosen' by the LASSO run on each dataset. So, I was hoping & expecting the LASSO_Coeffs[[i]][["coefficients"]] comand to return a single row/set of selected factors for each dataset, but it returns 500 rows of coefficients for every csv instead. – Marlen Aug 19 '22 at 21:38
  • I understand yes, and I am wondering how you got it to only output a single row. What is the code cut off in your screenshot, to produce `LASSO_coeffs1`, the desired output? – dcsuka Aug 20 '22 at 02:27
  • The problem is that when running the code on the entire file folder full of 47k csv datasets, it is not outputting only a single row per dataset like it should, it is actually outputting 500 (the number of rows in each dataset) for each of the, and that is the problem. It only prints out the first 30 rows for each though. However, I have been able to get it to work and output only a single row per dataset when I read & load each dataset one at a time, then run a LASSO on that dataset only. I just can't get that to happen when I try to do it for real so to speak. – Marlen Aug 20 '22 at 04:24
  • And now, I am having a very similar problem with running/fitting 47k (BE) Stepwise regressions using the step() function from the stats package as well. It is slightly less difficult with Stepwise because the step function doesn't require the Independent Variables/'x' argument to be formatted as a matrix the way the enet() function I used to run my LASSO Regressions does. – Marlen Aug 20 '22 at 04:26
  • You said "I have been able to get it to work and output only a single row per dataset when I read & load each dataset one at a time." Can you edit your question, copying and pasting the code for one of these occasions where it worked, with the associated code? I am referring to `LASSO_Coeffs1` in your image. Please use one of the csvs you sent me, and start with `read.csv(the_csv)`, and produce the expected output. That is essential to solve this. – dcsuka Aug 20 '22 at 04:56
  • Understood, that makes a lot of sense, I'll be right back with it in 15-20 minutes! – Marlen Aug 21 '22 at 01:47
  • Okay, I code the code and the output now, but before I post it in another comment below, I have to clarify something: the output I got in the screenshot when running print(LASSO_Coeffs1 is actually not correct. I will show you the correct output in the following comment, then how I need to reformat it as well which I have already figured out how to do myself. – Marlen Aug 21 '22 at 02:10
  • Sure, just show me in copy and pasted code exactly what you want to do with one dataframe, and I can apply to all no problem. – dcsuka Aug 21 '22 at 02:25
  • My code is much too large and to include the output makes it all the more impossible to copy-paste it here as a comment, so I have decided to ask a new question and post it all there. I will post the hyperlink to my new question here when I have posted it, so stay tuned! – Marlen Aug 21 '22 at 02:43
  • Since this is unresolved, it is probably better here. You can just get rid of the screenshot to make more space. But really, it will just be `read.csv` and then apply a function, so you can cut down to that. – dcsuka Aug 21 '22 at 02:46
  • Here is the link to my new question post worded much differently: https://stackoverflow.com/questions/73431410/is-there-a-simple-way-to-generalize-code-which-successfully-fits-a-lasso-regress If you would still like me to edit this question/post instead, I can do so, but I thought starting over fresh would be simple for me that going through everything I typed in this post and rewording everything one line at a time! I also uploaded sample_obs2 to a new repository in my Github account: https://github.com/Spencermstarr/Estimated-Exhaustive-Regression-Project – Marlen Aug 21 '22 at 03:25
  • I just edited this post per your request anyway even after asking the question you suggested that I should have asked originally in a separate post! Hope this helps. – Marlen Aug 21 '22 at 03:51
  • No worries, let me know if my answer works! Also, can you mark me as a contributor on the project's github? I'm applying for jobs. – dcsuka Aug 21 '22 at 03:54

0 Answers0