3

Need to apply Branch and bound method to choose best model. leaps() from leaps package works well, only if the data has no NA values, otherwise throws an error:

#dummy data
x<-matrix(rnorm(100),ncol=4)
#convert to 0,1,2 - this is a genetic data, NA=NoCall
x<-matrix(round(runif(100)*10) %% 3,ncol=4)
#introduce NA=NoCall
x[1,1] <-NA
#response, case or control
y<-rep(c(0,1,1,0,1),5)
leaps(x,y)

Error in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = NCOL(x) + int,  : 
  NA/NaN/Inf in foreign function call (arg 4)

Using only complete.cases() is not an option as I lose 80% of data.

What is an alternative to leap that can handle NAs? I am writing my own function to do something similar, but it is getting big and clunky, I feel like I am reinventing the wheel...

UPDATE: I have tried using stepAIC(), facing the same problem of missing data:

Error in stepAIC(fit) : 
  number of rows in use has changed: remove missing values?
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • would imputing the NA be reasonable for your data? and is `stepAIC()` (forward and back) really no good? – Stephen Henderson Nov 29 '13 at 11:55
  • Imputation is not an option. Does `stepAIC()` go through every possible combination? – zx8754 Nov 29 '13 at 12:02
  • 1
    no `MASS::stepAIC()` like `step()` does forward, back or both. Usually finds same model as `leaps()` for most datasets (I think???). – Stephen Henderson Nov 29 '13 at 12:54
  • 1
    OK, need a big picture clarification; how are you thinking to compare two models that are built on different data sets? Or is that part of the question? – Aaron left Stack Overflow Dec 02 '13 at 19:08
  • @Aaron with worry that this post turns into stats, I will give more details: with ~50 variables, intention is to build best possible model (prefer AIC, but can be anything). To calculate all possible combinations for 50 variables is not feasible, hence thought of using branch and bound. Ideally I need, `function(data, glm(outcome~all combinations), bootstrap=1000, return(best model(AIC)))` – zx8754 Dec 02 '13 at 20:44
  • 2
    This part of it, though, is a stats problem, as AIC can't compare models built with different data sets. So to compare models with and without certain variables, you need to remove the rows with missing values for those variables. You may need to "reconsider your modeling strategy", to quote [Ben Bolker](http://stackoverflow.com/a/11767674/210673). Otherwise you may also want to look into variants of AIC, a quick Google search brings up a [recent JASA article](http://dx.doi.org/10.1198/016214508000001057) that might be a good starting point. – Aaron left Stack Overflow Dec 02 '13 at 20:57

2 Answers2

1

you may try bestglm::bestglm where branch-bound method can be specified. The NAs can be handled by na.action argument as it in glm. see here for additional information: http://cran.r-project.org/web/packages/bestglm/vignettes/bestglm.pdf

David Z
  • 6,641
  • 11
  • 50
  • 101
  • Thanks for referring to `bestglm` package, didn't know about it. This package also requires elimination of `NA` rows, which I am trying to avoid. – zx8754 Dec 06 '13 at 11:58
0

This is a stats problem, as AIC can't compare models built with different data sets. So to compare models with and without certain variables, you need to remove the rows with missing values for those variables. You may need to "reconsider your modeling strategy", to quote Ben Bolker. Otherwise you may also want to look into variants of AIC, a quick Google search brings up a recent JASA article that might be a good starting point.

- Aaron

Community
  • 1
  • 1
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • I am getting the exact same error while I am using `bestglm(train, IC="AIC")`. My dataframe has no null rather some zeros. – Learner Mar 23 '17 at 12:36