1

Assessing model accuracy is reasonably easy with Bernoulli outcomes, but I am unsure how to generate meaningful predictions from an aggregated binomial regression.

Take this example. We want to model the number of drug counselling sessions (variable numCouns) a client attends over a twelve-week period based on: (1) how many years they had been using cannabis regularly prior to starting treatment (variable durationRegUse) and (2) the number of grams of cannabis they used on an average day (variable gms). The maximum number of counselling sessions each client can attend is six.

Here is the data

df <- data.frame(durationRegUse = c(19, 9, 13, 19, 10, 13, 2, 14, 11, 12, 7, 6, 3, 18, 17, 9, 9, 10, 0, 20, 4, 4, 8, 5, 4, 19, 25, 10, 27, 1, 10, 25, 8, 24, 8, 18, 15, 10, 6, 14, 16, 13, 4, 4, 5, 17, 13, 21, 8, 7, 10, 17, 13, 12, 28, 38, 23, 19, 36, 3, 14, 14, 22, 11, 26, 17, 4, 8, 25, 35, 14, 28, 32, 29, 22, 21, 2, 23, 35, 34, 31, 34, 15, 14, 26, 6, 3, 25, 24, 31, 31, 27, 30, 14.5, 12, 9, 3, 13, 5, 6, 23, 21, 27, 7, 36, 19, 22, 15, 11, 17, 11, 26, 21, 15),
                 gms = c(3.5, 2, 0.5, 10, 3, 3, 4, 4, 2, 2, 2, 2, 2, 2, 1, 1.75, 4, 1.75, 0.33, 5, 2.5, 1.25, 1, 0.5, 3, 2, 5, 3, 3, 0.571, 1, 0.5, 2, 4, 2.5, 1.25, 1.5, 1, 2.5, 2, 1, 2, 1.5, 2, 0.2, 1, 1, 2, 14, 2, 3.5, 3, 2, 1.75, 2, 0.55, 1, 2, 6, 0.5, 0.5, 0.5, 3, 1, 2.75, 4.5, 3, 3, 3, 2, 2, 1, 2.5, 1.75, 1, 1.5, 2, 0.7, 7, 0.5, 2, 1.2, 0.4, 3, 0.8, 1.3, 1.2, 2, 1.5, 3, 2, 2, 4, 3, 1, 6, 1, 0.5, 1.5, 2.5, 1, 2.5, 1.5, 1, 1.5, 2.5, 1.5, 2.5, 10, 1.5, 1.5, 0.5, 5, 1.5),
                 numCouns = c(6, 1, 2, 6, 0, 6, 0, 0, 2, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 2, 5, 6, 0, 0, 6, 0, 6, 3, 6, 0, 0, 0, 4, 5, 0, 0, 4, 0, 4, 3, 0, 1, 2, 6, 4, 2, 4, 3, 1, 0, 2, 2, 5, 2, 0, 1, 3, 0, 3, 2, 1, 6, 0, 0, 1, 0, 1, 2, 0, 0, 5, 1, 1, 1, 5, 3, 5, 6, 6, 5, 3, 6, 2, 4, 3, 4, 6, 1, 0, 6, 4, 3, 3, 1, 5, 0, 1, 1, 6, 6, 6, 3, 3, 2, 0, 0, 5, 1, 6, 3, 0, 0))

To model it as an aggregated binomial regression we need to create a coverage variable (the max number of sessions.)

df$coverage <- 6

Now we can create the aggregated binomial regression model

aggBinMod <- glm(
             formula = cbind(numCouns, coverage - numCouns) ~ durationRegUse + gms,
                 data = df,
                 family = binomial(link = "logit"))

And here is the output

summary(aggBinMod)

#output
# Coefficients:
#                 Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -1.157570   0.183116  -6.322 2.59e-10 ***
# durationRegUse  0.035975   0.008455   4.255 2.09e-05 ***
# gms             0.075838   0.039273   1.931   0.0535 .

Now is the part I am unsure of: How to generate predictions with which to assess model accuracy. Now, as I understand it if we use the predict() function, selecting "response" as the type we get a predicted per-trial probability of drawing a 1 from a Bernoulli response scale (i.e. [0,1]).

predBin <- predict(aggBinMod, type = "response")
predBin
# (predicted bernoulli probability for first 16 participants)
# 1         2         3         4         5         6         7         8 
# 0.4480346 0.3357882 0.3425441 0.5706073 0.3611657 0.3864206 0.3138308 0.4132440 
# 9        10        11        12        13        14        15        16 
# 0.3520203 0.3602692 0.3199350 0.3121589 0.2894678 0.4113600 0.3845787 0.3315728

So, following that logic, in order to generate predictions for the number of sessions for each client from our aggregated binomial regression model, we should be able to simply multiply this value by the number of trials we wish to predict, in our case 6. So to generate the predictions we would run

predBin6 <- predict(aggBinMod, type = "response")*6
predBin6
# predicted number of sessions, out of a possible 6), for first 18 clients
# 1        2        3        4        5        6        7        8        9 
# 2.688208 2.014729 2.055265 3.423644 2.166994 2.318524 1.882985 2.479464 2.112122 
# 10       11       12       13       14       15       16       17       18 
# 2.161615 1.919610 1.872954 1.736807 2.468160 2.307472 1.989437 2.222478 2.037563 

And from there it is straightforward to assess model accuracy via the mean squared error

error <- predBin6 - df$numCouns
mse <- mean(error^2)
mse

# output
# [1] 4.871892

So my question is is this the correct way to generate predictions from an aggregated binomial regression?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
llewmills
  • 2,959
  • 3
  • 31
  • 58

1 Answers1

1

More or less, yes.

Instead of hard-coding the fact that there are 6 trials per observation (in some applications the number of trials differs from observation to observation), I would recommend

predBin6 <- predict(aggBinMod, type = "response")*weights(aggBinMod)

(which should give the same answer in your case).

I would also say that MSE is reasonable, but not necessarily the best measure of predictive accuracy for a binomial model (it doesn't take the dependence of the variance on the mean into account). (I don't have a particular alternative recommendation, but the deviance (deviance(aggBinMod)) or something similar might be appropriate.)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks @Ben Bolker. The reason I wanted to calculate accuracy was to assess the performance of this model over a poisson regression. My understanding is that you can only compare models using likelihood ratio tests if the models are nested, which, I gather, aggregated binomial and poisson models are not. So in that circumstance MSE would be the best way to compare the models right? More than happy to be corrected. – llewmills Mar 18 '21 at 21:05
  • 1
    Fair enough. It's an open question what would be best; it really depends what you want to predict/the goal of fitting the model in the first place (i.e., how does "predictive accuracy" map on to "making the right real-world decision given these data"?) In this case I probably wouldn't bother with the Poisson model at all since the binomial is obviously more appropriate. If you're comparing Poisson to binomial then the two models have the same number of parameters; you *can* compare the log-likelihoods directly, you just can't do a significance test (which you can't with a MSE either). – Ben Bolker Mar 18 '21 at 21:10
  • 2
    Alternatively, depending on what you have in mind for your analysis, you might want to use simulated draws, instead of expectation - `preds <- sapply(1:length(predBin), function(id){rbinom(n=1, size = weights(aggBinMod)[id], prob = predBin[id])})`. To compare models you could use the AIC - `aggBinMod$aic`. HTH – Plamen Petrov Mar 18 '21 at 21:10
  • AIC or AICc or a Vuong test would all be reasonable if you want a more formal approach. – Ben Bolker Mar 18 '21 at 21:10
  • 1
    @PlamenPetrov, good, but I wouldn't trust `$aic`. Use `AIC(aggBinMod)` instead. `$aic` is really weird. – Ben Bolker Mar 18 '21 at 21:10
  • @BenBolker fair enough – Plamen Petrov Mar 18 '21 at 21:12
  • see `?glm` for more details of when `$aic` is valid or not ... – Ben Bolker Mar 18 '21 at 21:19