0

I have run a Backward Elimination Stepwise Regression on 58,000 different randomly generated synthetic datasets sequentially, separated out and reformatted the output in the manner I need it, namely, just the name of each csv formatted dataset and the variables selected by the BE run on it. Now I need to use that output to quantify how many of those selected models are correct. The true underlying population/structural regression equation characterizing each dataset is known by construction because this is a Monte Carlo Simulation.

The following commands were run from my "Quantifying BE's performance" script in my GitHub Repository for this research project. I have stored the output in an object called BM2_models, it looks like this:

> BM2_models <- read.csv("IVs_Selected_by_BE (no headers).csv", header = FALSE)
> head(BM2_models, n = 5)
                                                      V1
1                      0-3-1-1;  X1, X2, X3, X4, X7, X18
2                0-3-1-2;  X1, X2, X3, X7, X13, X16, X20
3                0-3-1-3;  X1, X2, X3, X6, X11, X14, X21
4  0-3-1-4;  X1, X2, X3, X4, X8, X10, X16, X17, X18, X24
5 0-3-1-5;  X1, X2, X3, X8, X11, X14, X20, X24, X26, X29

> tail(BM2_models, n = 2)                                                                                                              V1
57999 1-15-9-499;  X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X19, X20, X22, X23, X27
58000          1-15-9-500;  X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X22, X23, X25, X26

> str(BM2_models)
'data.frame':   58000 obs. of  1 variable:
 $ V1: chr  "0-3-1-1;  X1, X2, X3, X4, X7, X18" "0-3-1-2;  X1, X2, X3, X7, X13, X16, X20" "0-3-1-3;  X1, X2, X3, X6, X11, X14, X21" "0-3-1-4;  X1, X2, X3, X4, X8, X10, X16, X17, X18, X24" ...

The n1-n2-n3-n4s before the semicolons represent the names of each csv file and what comes after them are the variables selected (out of 30 candidate variables) by the Stepwise Regression run on the dataset in that csv file; what each n means is explained in a p.s. section at the bottom. The thus far impenetrable next step from here to count or quantify the number of those models which BE selected which are correct and the problem is that I can't just count or sum up the correctly selected models in a straightforward manner by running:

    n_df <- do.call(rbind.data.frame, lapply(strsplit(BM1_models$V1, ";"),
        function(x) { s <- strsplit(x, "-") c(s[[1]], s[[2]]) })) |> setNames
(c("n1", "n2", "n3", "n4", "IV"))

Then

CSM3 <- sum(sub_3_df$IV == "  X1, X2, X3")

in order to count how many 3-Factor Models were selected correctly because some of the selected models are in reverse order but still correct. For instance, row 55 is "X2, X3, X1", but that is still correct. So I need to figure out how to modify the simply function above in order to accommodate all combinations of the first 3 factors.

p.s. The n1 represents the degree of multicollinearity between the regressors in the true underlying model, n2 represents the number of variables k, n3 represents the Error Variance, and n4 is just a counter/tracker that goes from 1 to 500 for each increment of the other 3 that represents 500 different random possible variations given those parameters that were generated.

Marlen
  • 171
  • 11

1 Answers1

1

How about:

CSM3 <- sum(identical(sort(str_split(str_trim(sub_3_df$IV), ", ?")[[1]]), sort(str_split(str_trim("  X1, X2, X3"), ", ?")[[1]])))

This syntax might be slightly wrong, but basically just compare a sorted version of of a trimmed split of each.

dcsuka
  • 2,922
  • 3
  • 6
  • 27
  • I'll try it out and let you know how it goes right now! Thanks for the suggestion, I fully understand how tricky this question is. I think with the help of my collaborator, we finally figured out how to do this in Excel yesterday, but the number looks WAY too small to pass the smell test, so hopefully I get a more plausible count with this method in R. – Marlen Oct 12 '22 at 12:30
  • Okay, so your proposed solution does run sucessfully for each of sub_n_df from 3 to 15. However, for about half of them, it gives the following warning (but they still run anyway) Warning message: In sort(str_split(str_trim(sub_7_df$IV), ", ?")[[1]]) == sort(str_split(str_trim(" X1, X2, X3"), : longer object length is not a multiple of shorter object length And more importantly, the sums they return are too small, the biggest any of them returned was 2 and most of them only returned 1. For instance, CSM5 returns 1, but I scrolled through sub_5_df and counted at least 4. – Marlen Oct 12 '22 at 13:01
  • How about my edits. – dcsuka Oct 12 '22 at 18:33
  • You might have to Map, mapply, or map2 these functions onto the sub_3_df$IV if necessary. – dcsuka Oct 12 '22 at 18:41