Counting the number of correct regression models selected by FS Stepwise Regression in R ran on N datasets

Question

I need to count how many times the Models selected by a Forward Selection Stepwise Regression I have run on 58,500 different datasets are correctly specified (are the true underlying/structural regression models for their corresponding dataset).

The following commands were run from my "Quantifying FS's performance" script in my GitHub Repository for this research project, the results that script begins by loading were generated using the "Both BE and FS script", and the file folder(s) with the 58,500 csv formatted datasets that script is run on can be found here, and here (but you'll have to copy-paste the 500 from the 2nd into the 1st locally).

I have stored the output in an object called BM3_models, it looks like this:

> head(BM3_models, n = 4)
                                                                   V1
1                 0-3-1-1;  X3, X1, X2, X4, X18, X7                  
2            0-3-1-2;  X3, X1, X2, X13, X7, X16, X20                 
3            0-3-1-3;  X1, X2, X3, X11, X21, X6, X14                 
4 0-3-1-4;  X2, X3, X1, X10, X17, X18, X24, X4, X8, X16              
> tail(BM3_models, n = 2)
                                                                                                      
    V1
58499 1-15-9-499;  X1, X8, X11, X10, X15, X2, X12, X6, X7, X13, X5, X4, X3, X9, X14, X20, X27, X22, X19, X23    
58500 1-15-9-500;  X1, X15, X6, X8, X14, X3, X5, X7, X11, X10, X9, X4, X2, X12, X13, X22, X26, X25, X23, X29 

> str(BM3_models)
'data.frame':   58500 obs. of  1 variable:
 $ V1: chr  "0-3-1-1;  X3, X1, X2, X4, X18, X7                  " "0-3-1-2;  X3, X1, X2, X13, X7, X16, X20                 " "0-3-1-3;  X1, X2, X3, X11, X21, X6, X14                 " "0-3-1-4;  X2, X3, X1, X10, X17, X18, X24, X4, X8, X16              " ...

I ran the following to separate out all the 4,500 models selected by FS for each N-Factor case from 3, 4, 5, ..., 15:

n_df <- do.call(rbind.data.frame, lapply(strsplit(BM3_models$V1, ";"), function(x) 
{ s <- strsplit(x, "-") c(s[[1]], s[[2]]) })) |> setNames(c("n1", "n2", "n3", "n4", "IV"))

Then, I also ran the following proposed solution for the each of 3 through 15 separately (where the 3 is referring to the 3 in sub_3_df):

sub_3_df <- subset(n_df, n2 == "3")
CSM3 <- sum(sort(str_split(str_trim(sub_3_df$IV), ", ?")[[1]]) == sort(str_split(str_trim("  X1, X2, X3"), ", ?")[[1]]))
print(CSM3)
> CSM3
[1] 1
sub_4_df <- subset(n_df, n2 == "4")
CSM4 <- sum(sort(str_split(str_trim(sub_4_df$IV), ", ?")[[1]]) == sort(str_split(str_trim("  X1, X2, X3, X4"), ", ?")[[1]]))
print(CSM4)
> print(CSM4)
[1] 2

Where:

> head(sub_3_df, n = 3)
  n1 n2 n3 n4                                               IV
1  0  3  1  1        X3, X1, X2, X4, X18, X7                  
2  0  3  1  2   X3, X1, X2, X13, X7, X16, X20                 
3  0  3  1  3   X1, X2, X3, X11, X21, X6, X14

This above code was offered as a potential solution to a previous question I asked here on Stack Overflow but about counting the number of correctly selected models for 58.5K Backward Elimination Stepwise Regressions run on the same datasets. It does run, but the number of correct models selected it returns is too small because for example, as you can see, CSM4 returns 2 when printed; but there are at least 3 correct model selections by FS for all datasets with true 4-Factor underlying models, the first 3 are in rows 1568, 1663, and 1784 respectively (screenshots of these 3 can also be found in the GitHub repository).

post script: The n1-n2-n3-n4s before the semicolons represent the names of each csv file and what comes after them are the variables selected (out of 30 candidate variables) by the Stepwise Regression run on the dataset in that csv file. In order to count how many 3-Factor Models were selected correctly, all correct combinations must be counted because some of the selected models are in reverse order but are still correct. For instance, row 55 is "X2, X3, X1", but that is still correct. So I need to figure out how to modify the simply function above in order to accommodate all combinations of the first 3 factors.

Counting the number of correct regression models selected by FS Stepwise Regression in R ran on N datasets

0 Answers0