I have run a Backward Elimination Stepwise Regression on 58,000 different randomly generated synthetic datasets sequentially, separated out and reformatted the output in the manner I need it, namely, just the name of each csv formatted dataset and the variables selected by the BE run on it. Now I need to use that output to quantify how many of those selected models are correct. The true underlying population/structural regression equation characterizing each dataset is known by construction because this is a Monte Carlo Simulation.
The following commands were run from my "Quantifying BE's performance" script in my GitHub Repository for this research project. I have stored the output in an object called BM2_models, it looks like this:
> BM2_models <- read.csv("IVs_Selected_by_BE (no headers).csv", header = FALSE)
> head(BM2_models, n = 5)
V1
1 0-3-1-1; X1, X2, X3, X4, X7, X18
2 0-3-1-2; X1, X2, X3, X7, X13, X16, X20
3 0-3-1-3; X1, X2, X3, X6, X11, X14, X21
4 0-3-1-4; X1, X2, X3, X4, X8, X10, X16, X17, X18, X24
5 0-3-1-5; X1, X2, X3, X8, X11, X14, X20, X24, X26, X29
> tail(BM2_models, n = 2) V1
57999 1-15-9-499; X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X19, X20, X22, X23, X27
58000 1-15-9-500; X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X22, X23, X25, X26
> str(BM2_models)
'data.frame': 58000 obs. of 1 variable:
$ V1: chr "0-3-1-1; X1, X2, X3, X4, X7, X18" "0-3-1-2; X1, X2, X3, X7, X13, X16, X20" "0-3-1-3; X1, X2, X3, X6, X11, X14, X21" "0-3-1-4; X1, X2, X3, X4, X8, X10, X16, X17, X18, X24" ...
The n1-n2-n3-n4s before the semicolons represent the names of each csv file and what comes after them are the variables selected (out of 30 candidate variables) by the Stepwise Regression run on the dataset in that csv file; what each n means is explained in a p.s. section at the bottom. The thus far impenetrable next step from here to count or quantify the number of those models which BE selected which are correct and the problem is that I can't just count or sum up the correctly selected models in a straightforward manner by running:
n_df <- do.call(rbind.data.frame, lapply(strsplit(BM1_models$V1, ";"),
function(x) { s <- strsplit(x, "-") c(s[[1]], s[[2]]) })) |> setNames
(c("n1", "n2", "n3", "n4", "IV"))
Then
CSM3 <- sum(sub_3_df$IV == " X1, X2, X3")
in order to count how many 3-Factor Models were selected correctly because some of the selected models are in reverse order but still correct. For instance, row 55 is "X2, X3, X1", but that is still correct. So I need to figure out how to modify the simply function above in order to accommodate all combinations of the first 3 factors.
p.s. The n1 represents the degree of multicollinearity between the regressors in the true underlying model, n2 represents the number of variables k, n3 represents the Error Variance, and n4 is just a counter/tracker that goes from 1 to 500 for each increment of the other 3 that represents 500 different random possible variations given those parameters that were generated.