6

I am interested in using cross validation (leave-one-out or K-folds) to test several different negative binomial GLMs that I have created. I am using the glm.nb() function from MASS to run negative binomial regression.

My question is whether I can use the cv.glm() from boot to test this model. I am leaning towards no, but wondered if anyone knew a function that would let me perform K-folds validation (or leave one out). Or, perhaps cv.glm() is perfectly valid for negative binomial.

Here is some data from an online example to help out. I had thought that cross-validation results ($delta) should be bounded between 0 and 1, but this is not the case below (and possibly indicative of something gone awry) http://www.ats.ucla.edu/stat/r/dae/nbreg.htm

I've found a few questions about interpreting the output of cv.glm(), but not specifically about how to do cross validation with negative binomial models in R

require(foreign)
require(ggplot2)
require(MASS)
require(boot)

dat <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
 prog <- factor(prog, levels = 1:3, labels = c("General", "Academic","Vocational"))
  id <- factor(id)
})


summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))

#This code will run, but is it valid for negative binomial GLM models?
cv.glm(dat,m1,K=10)

[I'm not sure if this question belongs here or on Cross Validated...]

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Kodiakflds
  • 603
  • 1
  • 4
  • 15

1 Answers1

5

Using cv.glm() is valid for any glm object. It does not use any update formula to compute cross-validation score, but simply drops a group, fit a model for reduced dataset, and compute a prediction error. Such is done many times to obtain a final average. In this way, the leave-1-out cross-validation (default) is much more costly than K-fold cross-validation.

Why is it valid for any glm object? How does it know what model it should fit? Well, you tell it by passing in your fitted model m1. Have a look at:

m1$call
#glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
#    link = log)

When cv.glm drops data and refits the model for reduced dataset, it uses such call. So you are definitely fitting a negative binomial model every time.


I am trying to interpret the results from the cv.glm()$delta. In the example above, the model is predicting "Days absent from school", and the output of $delta is ~42.1. Would I interpret this as "on average, this model has a prediction error of 42 days"?

By default, cv.glm() returns MSE, defined by the cost argument:

cost = function(y, yhat) mean((y - yhat)^2

So 42.1 is really prediction variance. You need RMSE, i.e., sqrt(42.1) = 6.5, for prediction error / standard deviation. This has the same unit as your response variable.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • As a followup: I am trying to interpret the results from the cv.glm() output. Between [Wikipedia](https://en.wikipedia.org/wiki/Mean_squared_error) and this helpful [r-tutorial](https://uw-pols503.github.io/pols_503_sp15/docs/Life_Expectancy_Example.html) It appears as though cv.glm$delta is either MSE or RMSE and the units are the same as the predicted variable. In the example above, the model is predicted 'Days absent from school", and the output of cv.glm$delta is ~42.1 Would I interpret this as " on average, this model has a prediction error of 42 days"? – Kodiakflds Jul 08 '16 at 19:17