I comparing a new statistical learning algorithm which tries to find the optimal factors & overall model among all candidates for a paper and I need to compare it to the two main benchmark methods currently used, namely, LASSO & Stepwise Regression. I am going to run all 3 using the same set.seed value on the same file folder with 47,500 synthetic csv file datasets, each of which has 31 columns (30 candidate IVs/factors & 1 DV) i.e. a comparison via Monte Carlo Simulation.
Here is my code:
# the 2 lines below together create a simple character list of
# all the file names in the file folder of datasets you created
directory_path <- "~/DAEN_698/sample obs"
file_list <- list.files(path = directory_path, full.names = TRUE,
recursive = TRUE)
# this line reads all of the data in each of the csv files
# using the name of each store in the list we just created
csvs <- lapply(file_list, read.csv)
# this function fits all 47,50[![enter image description here][1]][1]0 LASSO regressions
# and outputs the summary results when called
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(csvs, function(i) {
enet(x = as.matrix(i[-1]), y = i$Y, lambda = 0, normalize = FALSE)
})
# this stores and prints out all of the regression
# equation specifications selected by LASSO when called
LASSO_Coeffs <- lapply(LASSO_fits, predict)
The problem (as you can see in the 1st attached screenshot[1st Screenshot]) is that even though the Environment in RStudio says that the csvs, LASSO_fits, & LASSO_Coeffs objects are all lists of 47,500 elements which indicated to me that they had correctly only fit one LASSO to each dataset, when I print one of them out, I get 30 rows of coefficient estimates for each dataset i, not 1 like I should. Is my mistake in the predict argument of the lapply function in the last line of code or somewhere else? In other words, the Coefficient Estimates fit for each individual dataset when printed out should look like the one on top in the second screenshot, but they actually look like the Console output below which is completely wrong (warning, the following output from the Console is extremely long/large, you have been warned):
> print(LASSO_coeffs1)
$s
[1] 0.1
$fraction
0
0.1
$mode
[1] "fraction"
$coefficients
X1 X2 X3 X4 X5 X6 X7 X8 X9
0.20039732 0.13671726 0.12411170 0.06292652 0.07892046 0.00000000 0.00000000 0.00000000 0.00000000
X10 X11 X12 X13 X14 X15 X16 X17 X18
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X19 X20 X21 X22 X23 X24 X25 X26 X27
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X28 X29 X30
0.00000000 0.00000000 0.00000000
> LASSO2_Coeffs[[1]][["coefficients"]]
X1 X2 X3 X4 X5 X6 X7 X8
0 0.00000000 0.000000000 0.00000000 0.0000000 0.00000000 0.000000000 0.000000000 0.000000000
1 0.06450913 0.000000000 0.00000000 0.0000000 0.00000000 0.000000000 0.000000000 0.000000000
2 0.07198916 0.008349275 0.00000000 0.0000000 0.00000000 0.000000000 0.000000000 0.000000000
3 0.12366660 0.059560378 0.04936567 0.0000000 0.00000000 0.000000000 0.000000000 0.000000000
4 0.14511524 0.081678129 0.06892450 0.0000000 0.02097678 0.000000000 0.000000000 0.000000000
5 0.91860803 0.851771768 0.84108987 0.8804519 0.83171011 0.000000000 0.000000000 0.000000000
6 0.94120023 0.873851547 0.86377944 0.9080777 0.85608657 0.000000000 0.000000000 0.000000000
7 0.95081951 0.883359061 0.87335729 0.9205868 0.86614005 0.000000000 0.000000000 0.000000000
8 0.95697092 0.889083242 0.87974842 0.9283948 0.87281654 0.000000000 0.000000000 0.000000000
9 0.95782050 0.889802627 0.88061073 0.9294186 0.87368880 0.000000000 0.000000000 0.000000000
10 0.96887108 0.899273499 0.89153742 0.9436549 0.88563826 0.000000000 0.000000000 0.000000000
I have only included the first 10 rows for the first 8 factors, the full results stored in the coefficients component of the LASSO2_Coeffs object for each dataset is 500 rows for all 30 candidate factors! That is literally every single observation not on the Dependent Variable in every single one of the 47k datasets.
If you would like to verify any of this by running the code yourself you can find the script I ran it all with (called 'LASSO code (2)' & the file folder 'sample_obs2' in my Github account within the repository called "Estimated-Exhaustive-Regression-Project": enter link description here