How can I create multiple data models, then pick the model(s) that are evaluated to be the best fitting?

Question

I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.

I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.

splitprob = 0.7
trainindex = createDataPartition(scores$y, p=splitprob, list=F)
trainingset = scores[trainindex,]
testset = scores[-trainindex,]
model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)
summary(model)
prediction = predict.lm(model, trainingset, se.fit=T)

Now, what I want to do is run this script multiple times, produce multiple models, and then pick one or more models that will be used to make future predictions. Although I have set up the function to be run a certain number of times, I do not know how to set it up so that I can compare the different models against one another (probably by using the AIC), nor am I sure how I should capture parameters of the model(s) in order to export them to a text file or a .csv file.

I tried implementing the glmulti package, but due to various problems in using Java, rJava, and Mac OsSX, I have been having massive problems in getting it to install properly. Could anyone recommend me another approaches to this problem at all?

lampros · Accepted Answer · 2017-08-16T17:24:18.160

I updated (modified) the answer to include different predictors each time,

# build data

set.seed(1)
y = runif(100)
x = matrix(runif(1000), nrow = 100, ncol = 10)

function for repeated train - test splits, which returns summary statistics for all iterations (it takes the data, the number of iterations, a formula, and the split-percentage in train-test),

glm_train_test = function(y, x, num_iters, Formula, split_percentage = 0.5) {

  list_models = list()

  aic_models = matrix(0, nrow = num_iters, ncol = 2)

  rmse_models = matrix(0, nrow = num_iters, ncol = 2)


  for (i in 1:num_iters) {                                        # train - test - splits

    spl = sample(nrow(x), round(nrow(x) * split_percentage), replace = F) 

    train = data.frame(y = y[spl], x[spl, ])

    test = data.frame(y = y[-spl], x[-spl, ])

    tmp_formula = as.formula(Formula)

    fit = glm(formula = tmp_formula, data = train)

    # pred_train = predict(fit, newdata = train)

    pred_test = predict(fit, newdata = test)

    aic_models[i, ] = c(i, summary(fit)$aic)

    rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test))
  }


  # convert the resulted aic-rmse matrices to data frames

  sort_aic = as.data.frame(aic_models)
  colnames(sort_aic) = c('iteration', 'aic_value')

  sort_rmse = as.data.frame(rmse_models)
  colnames(sort_rmse) = c('iteration', 'rmse_value')

  tmp_aic = c(summary(sort_aic[, 2]), sd(sort_aic[, 2]))
  names(tmp_aic) = c(names(summary(sort_aic[, 2])), 'std.dev')

  tmp_rmse = c(summary(sort_rmse[, 2]), sd(sort_rmse[, 2]))
  names(tmp_rmse) = c(names(summary(sort_rmse[, 2])), 'std.dev')

  return(list(summary_aic = tmp_aic, summary_rmse = tmp_rmse))
}

first model including only three predictors,

first_mod = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 + X2 + X3", split_percentage = 0.5)

example output

first_mod

$summary_aic
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.   std.dev 
-6.193321 10.527612 13.317441 13.320968 17.019571 26.792904  5.798970 

$summary_rmse
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.    std.dev 
0.23862643 0.26628405 0.27568030 0.27713722 0.28616462 0.33223873 0.01730974

second model including also an interaction,

second_model = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 * X4 + X2 + X3", split_percentage = 0.5)

example output for the second model,

second_model

$summary_aic
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max.     std.dev 
-0.04232767 13.37572489 16.92720680 16.81206625 20.79921756 29.67830243  5.96477080 

$summary_rmse
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.    std.dev 
0.23912727 0.27026791 0.28117878 0.28177088 0.29526944 0.32674985 0.01819308

You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.

Thank you so much! But out of curiosity, how should I set my AIC value to be? Should I start low and just keep increasing it until my output becomes more limited? Also, I am having a little bit of trouble understanding the format of the output as I new to R... how do the numbers in the lists pertain to the values of the slopes of the variables or intercepts in the respective linear model? — Bob McBobson, Aug 14 '17 at 10:23
@BobMcBobson, I updated/modified my answer to also return the n-best model fits. — lampros, Aug 14 '17 at 15:45
Thanks for your informative update. I must also ask, after doing some digging into RSME, how should I interpret the models with the lowest RSME that somehow do not correspond with models with the lowest AIC/BIC? I am having difficulty locating resources that discuss the nuances between these three criteria. — Bob McBobson, Aug 15 '17 at 16:40
AIC and BIC belong to [model selection](https://en.wikipedia.org/wiki/Model_selection) criteria. Moreover the [glmulti](https://www.jstatsoft.org/article/view/v034i12) package that you mentioned in the question utilizes both of these criteria among other (see page 11 of the web link). There are many resources on the web on model selection. I added RMSE to illustrate the case where you would use your custom evaluation metric with the test data. — lampros, Aug 16 '17 at 11:47
Aha I understand. I guess then my final question is how many models I should build with the function. I have tried to find resources to figure out what would be the optimal number, but I get various answers with high levels of complexity. Furthermore, does the line `fit = glm(y~., data = train)` create models with all the parameters specified previously, or is the period merely a placeholder? — Bob McBobson, Aug 16 '17 at 13:37
I changed 'num_models' to 'num_iters' so that it's not confusing. To decide which model is the best for your data you should run the repeated train-test-splits (another resampling method would be cross-validation) for different parameters (complexity) of your model. *y~.* includes all the predictors of the data and corresponds to *y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7* in your data. — lampros, Aug 16 '17 at 15:52
Do you know how it would be possible to change which predictors are included in each iteration so that many different models can be explored? — Bob McBobson, Aug 16 '17 at 16:11
I modified the function and now it takes a formula of different predictors. You can compare the output summary statistics for each predictor combination. — lampros, Aug 16 '17 at 17:28

How can I create multiple data models, then pick the model(s) that are evaluated to be the best fitting?

1 Answers1