0

trying to use the glmnetUtils package from GitHub for formula interface to glmnet but predict is not estimating enough values

library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)


fitfun=function(dF){
  cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
  yvar=all.vars(formula(model)[[2]])
  print(paste('y variable:',yvar))
  print('observations')
  print(str(as.data.frame(datavals)[[yvar]]))
  print('predictions')
  print(str(predict(object=model,newdata=datavals)))
  stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}


flights %>% 
  group_by(carrier) %>% 
  do({
    crossv_mc(.,4) %>% 
      mutate(mdl=map(train,fitfun),
             r2=map2_dbl(mdl,test,gnetr2))
  })

the output from gnetr2():

[1] "y variable: arr_delay"
[1] "observations"
 num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
 num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3476] "1" "2" "3" "4" ...
  ..$ : chr "1"
NULL
Error: incompatible dimensions

any ideas what's going on? your help is much appreciated!

Dominik
  • 782
  • 7
  • 27

2 Answers2

1

This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action argument with the predict method for formula-based calls.

  • Setting na.action=na.pass (the default) will pad out the predictions to include NAs for rows with missing values
  • na.action=na.omit or na.exclude will drop these rows

Note that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.

Also took the opportunity to fix a bug where the LHS of the formula contains an expression.

Give it a go with install_github("Hong-Revo/glmnetUtils") and tell me if anything breaks.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • the default for `predict.glm` is `na.action = na.pass`, which drops observations with NA. Shouldn't `na.exclude` pad the predictions? – Dominik Sep 16 '16 at 15:22
  • You've got it backwards: `na.pass` keeps rows, and `na.exclude` drops rows. In general `predict.lm` and `glm` are complicated, but the behaviour I've got there should match the most common case where you leave the defaults unchanged when fitting the model, and specify newdata when predicting. – Hong Ooi Sep 16 '16 at 22:15
  • i guess i don't understand the semantics of predict.cv.glmnet. The new version works fine by default. if na.action=na.exclude then the same error occurs. I thought the default of predict() was na.pass hence the confusion. Thanks for the update! – Dominik Sep 17 '16 at 16:00
  • Right, if your data has NAs, and you want to predict on it, then just leave everything set to the default. – Hong Ooi Sep 17 '16 at 16:13
  • Also, the default for `predict.lm` and `glm` IS na.pass. – Hong Ooi Sep 17 '16 at 16:32
  • and predict.cv.glmnet uses predict.glm under the hood, correct? – Dominik Sep 17 '16 at 16:33
  • No. The implementation has nothing to do with lm or glm. – Hong Ooi Sep 17 '16 at 16:35
  • ah, therein lies my misunderstanding. Thanks! – Dominik Sep 17 '16 at 16:55
  • If this has left you confused about how predict.lm and glm handle NAs, just experiment with them a little. You should find that na.pass, na.omit and na.exclude all work the way I describe above. (You might also want to see what happens with and without a newdata argument.) – Hong Ooi Sep 17 '16 at 17:02
0

Turns out its happening because there are NA in the predictor variables so predict() results in a shorter vector since na.action=na.exclude.

Normally a solution would be to use predict(object,newdata,na.action=na.pass) but predict.cv.glmnet does not accept other arguments to predict.

Therefore the solution is to filter for complete cases before beginning

flights=flights %>% filter(complete.cases(.))

Dominik
  • 782
  • 7
  • 27