0

I created a list of linear models after performing k-fold cross validation. When I then use

map(modList, ~ stepwise(., direction = "backward",criterion = "AIC"))

I get this error:

Error in stepAIC(mod, scope = list(lower = lower, upper = upper), direction = direction, : number of rows in use has changed: remove missing values?

I found advice [here][1] and although I didn't have any NA values to begin with, I used na.omit() when I created each model:

mod <- lm(z_mara ~ . -1, data = na.omit(trainCV_sets[[i]]))

What's curious is that if I use stepwise() on each individual model within my modList, I only get the error on the first through third models:

stepwise(modList[[1]], direction = "backward", criterion = "AIC")

The fourth model in the list work just fine (e.g change [[1]] to [[4]]). The only difference I've found so far between the models that do not work and those that do is that the number of rows used to create the 1st-3rd models are 11 while has 12 rows. I tried changing the number of k-folds from 4 to 2 and I was able to reproduce a similar issue where stepwise() produces the error for some, but not all models.

I expected that the rows wouldn't matter, and I'm currently stuck on what to do next. I don't know of a way to avoid having unequal numbers of rows after cross validation because the data won't usually be divisible by whatever number of folds I choose. Thanks!

EDIT Below is a test dataset and code you can use:

training_pca <- tibble(z_mara = c(0.875112399579427, 1.13894237770963, 
0.981180086218575, 1.8299894270253, -0.0977104233331723, -0.86241558837047, 
-1.40078945237221, 0.030856469562766, -0.0969068802525719, -0.223063143906711, 
-0.339309042900122, -0.760633464827852, -1.01830294600679, -1.29686454728133, 
-1.39007554463088), PC1 = c(2.50796906215996, 2.68002216940135, 
-3.71526007848766, -4.1866795459591, 0.0486434530855795, 0.938304238296873, 
0.615097559709769, 4.51822579066407, 0.0667126218105262, -0.287378553280471, 
4.56603423346247, -0.933301952142784, -0.216058021757341, 0.0694218190496208, 
9.2292656007433), PC2 = c(2.17078441684463, -0.199721476838284, 
2.11735035612631, 3.03544498625801, -2.40626489140099, 2.35791076587204, 
-1.25582416436226, 1.05994935645006, -0.527916055713799, 0.070729476106444, 
-0.132485078109658, 0.949419002097709, -5.09998831832504, 1.0619915074115, 
2.3100848754419), PC3 = c(-0.678144405424994, 1.09058577380406, 
-1.0176297169444, -1.88552787304444, -1.25338902691145, 0.210499981706639, 
0.04477690001186, -1.38969813324165, -0.804920509008458, 0.879138015029564, 
1.29938967213637, 0.138188073442918, -5.26695126718185, -2.14797795807358, 
1.62335052575001), PC4 = c(-2.3887668089524, 3.97903517088607, 
0.867800635964586, -2.03442689569174, 0.13869948487601, -0.0664645238981098, 
0.670700717048399, 3.11260554952338, 0.408133658723152, 3.44132931097882, 
1.10549707214258, 1.0954984094734, -0.49754510472686, -2.85815085785707, 
2.04309500230981), PC5 = c(0.821714377740757, 3.78554017369849, 
0.230169453679477, -1.9039365507055, 0.0828762592235243, -1.83817906200712, 
-2.4669972669382, 1.88139312797292, -1.28458814157848, -1.74987295529079, 
0.342015969239523, -0.819381168596516, 1.64847329083668, -3.95311876523071, 
-0.306975624219868))

set.seed(3654)
trainingCV_pca <- crossv_kfold(training_pca, k = 4)

trainCV_sets <- vector(mode = "list", length = nrow(trainingCV_pca))
testCV_sets <- vector(mode = "list", length = nrow(trainingCV_pca))

for(i in 1:nrow(trainingCV_pca)) {
  trainCV_set_name <- paste("trainCV", i, sep = "")
  trainCV_idx <- trainingCV_pca[[1]][[i]][[2]]
  trainCV_set <- training_pca[trainCV_idx,]
  assign(trainCV_set_name, trainCV_set)
  trainCV_sets[[i]] <- trainCV_set
  
  testCV_set_name <- paste("testCV", i, sep = "")
  testCV_set <- training_pca[-trainCV_idx,]
  assign(testCV_set_name, testCV_set)
  testCV_sets[[i]] <- testCV_set
}

modList <- vector(mode = "list", length = nrow(trainingCV_pca))

for(i in 1:nrow(trainingCV_pca)) {
  train_mod_name <- paste("fullTrainMod", i, sep = "")
  mod <- lm(z_mara ~ . -1, data = trainCV_sets[[i]]) # centered and scaled so no intercept
  modList[[i]] <- assign(train_mod_name, mod)
}

map(modList, ~ stepwise(., direction = "backward",criterion = "AIC"))```


  [1]: https://stackoverflow.com/questions/11819472/why-does-the-number-of-rows-change-during-aic-in-r-how-to-ensure-that-this-does
  [2]: https://github.umn.edu/hesse151/Marathon-Machine-Learning/blob/master/code/pre-reg/pre-reg_regress_pca.R
  [3]: https://github.umn.edu/hesse151/Marathon-Machine-Learning/blob/master/data/processed/pca_training.csv
a.hesse
  • 53
  • 7
  • 1
    this is a reasonable question, but it's sufficiently challenging to replicate (go find code which is hidden behind "Github enterprise/sign in via LDAP") that I probably won't try to answer it ... – Ben Bolker Aug 31 '21 at 19:51
  • I didn't think about the enterprise restriction. I'll try to update the post and/or links to make it easier to reproduce. – a.hesse Sep 01 '21 at 20:14
  • even so, it would help a great deal if you can generate a **much simpler** [mcve] ... – Ben Bolker Sep 01 '21 at 20:41
  • I've made updates that should allow you to reproduce the example much more easily. If there are still errors, let me know. – a.hesse Sep 07 '21 at 16:07

1 Answers1

0

I can't offer much more than a wild guess, but it's probably some sort of irregularity in the data. If you have factor/categorical data instead of purely numeric stuff, I'd be very suspicious of factors that only occur a few times. Also check for variables that may be monolithic (e.g. '0' for all values). Sometimes this causes singularities or infinities in the models that means the number of parameters resulting from the model is fewer than you would expect. Similarly, make sure everything you think is numeric really is defined as numeric, as sometimes you'll do something weird and "1.036" is a string or a factor.

I wish I could help more. I certainly understand the frustrations of an obscure error coming out of a package you don't have the code for. You might try to see if you can find the source code somehow.