1

I have found very limited information on how ggeffects handles offsets. I found this article describing different ways packages in R and Stata handle offsets. I implemented the example in the website and discovered that ggpredict is estimating the factor variables at the mean value of the offset, but I thought it would not based on it using predict() rather than emmeans(), which behave differently. Any idea how to get ggpredict to use the actual offset values rather than the mean of the offset? Or another way to calculate them with a glmmTMB object? The results are very, very different at the mean value than averaged over the actual offset values. I'm unable to implement the example given of estimating the marginal effects by hand due to random effects and several more continuous variables in addition to factor variables...

ggpredict(nb_glm_offset)    
$age
# Predicted counts of claims

age   | Predicted |         95% CI
----------------------------------
<25   |     27.44 | [22.97, 32.78]
25-29 |     24.17 | [20.88, 27.97]
30-35 |     21.30 | [18.51, 24.50]
>35   |     18.04 | [16.21, 20.07]

Adjusted for:
* ln_holders = 4.90

$ln_holders
# Predicted counts of claims

ln_holders | Predicted |             95% CI
-------------------------------------------
         1 |      0.55 | [   0.46,    0.66]
         2 |      1.50 | [   1.26,    1.80]
         3 |      4.09 | [   3.42,    4.88]
         4 |     11.11 | [   9.30,   13.27]
         5 |     30.19 | [  25.28,   36.07]
         6 |     82.08 | [  68.71,   98.05]
         7 |    223.11 | [ 186.77,  266.52]
         9 |   1648.59 | [1380.06, 1969.36]

Adjusted for:
* age = <25

attr(,"class")
[1] "ggalleffects" "list"        
attr(,"model.name")
[1] "nb_glm_offset"
ClaireR
  • 21
  • 7

1 Answers1

0

The article you found pre-transformed (before fitting the model) the offset variable by log. ggpredict() does not take pre-transformed offset as flexible as predict(). If you prefer to use ggpredict(), you can fit a model with an in-model transformed offset, and correctly specify the condition argument in ggpredict().

Here is a reproducible example:

library(tidyverse)
library(ggeffects)

set.seed(123)

# the data used in the article
insurance = MASS::Insurance %>%
  rename_all(tolower) %>%
  mutate(
    # create standard rather than ordered factors for typical output
    age   = factor(age, ordered = FALSE),
    group = factor(group, ordered = FALSE),
    # create a numeric age covariate for later
    age_num = case_when(
      age == '<25'   ~ sample(18:25, n(), replace = T),
      age == '25-29' ~ sample(25:29, n(), replace = T),
      age == '30-35' ~ sample(30:35, n(), replace = T),
      age == '>35'   ~ sample(36:75, n(), replace = T),
    ),
    # for stata consistency
    ln_holders = log(holders)
  )

# fitting two models 
# the first one used pre-transformed offset (the author's model)
# the second one used in-model transformed offset
nb_glm_offset_1 = MASS::glm.nb(claims ~  age + offset(ln_holders), data = insurance)
nb_glm_offset_2 = MASS::glm.nb(claims ~  age + offset(log(holders)), data = insurance)

# predict() for pre-transformed offset
predictions_1 <- 
  expand.grid(age = levels(insurance$age), 
              ln_holders = insurance$ln_holders) %>% 
  mutate(prediction = predict(nb_glm_offset_1, newdata = ., type = 'response')) %>% 
  group_by(age) %>% 
  summarise(avg_prediction = mean(prediction)) 

# predict() for in-model transformed offset
predictions_2 <- 
  expand.grid(age = levels(insurance$age), 
              holders = insurance$holders) %>% 
  mutate(prediction = predict(nb_glm_offset_2, newdata = ., type = 'response')) %>% 
  group_by(age) %>% 
  summarise(avg_prediction = mean(prediction)) 

# ggpredict() for pre-transformed offset
ggpredictions_1 <- ggpredict(nb_glm_offset_1, condition = c(ln_holders = mean(insurance$ln_holders)))    
# ggpredict() for in-model transformed offset
ggpredictions_2 <- ggpredict(nb_glm_offset_2, condition = c(holders = mean(insurance$holders)))    

predictions_1:

  age   avg_prediction
1 <25             74.3
2 25-29           65.4
3 30-35           57.6
4 >35             48.8

predictions_2:

  age   avg_prediction
1 <25             74.3
2 25-29           65.4
3 30-35           57.6
4 >35             48.8

ggpredictions_1$age (which is your result):

age   | Predicted |         95% CI
----------------------------------
<25   |     27.44 | [22.97, 32.78]
25-29 |     24.17 | [20.88, 27.97]
30-35 |     21.30 | [18.51, 24.50]
>35   |     18.04 | [16.21, 20.07]

ggpredictions_2$age:

age   | Predicted |         95% CI
----------------------------------
<25   |     74.26 | [62.16, 88.71]
25-29 |     65.40 | [56.51, 75.70]
30-35 |     57.64 | [50.09, 66.32]
>35   |     48.82 | [43.87, 54.32]

Offset value varies dramatically case by case, so it is reasonable for a function to take the mean. In the example the mean was 364.9844. If you want a more meaningful offset value for visualising and interpreting the results, you could specify the holders value in the condition argument to 500. The interpretation then becomes for every 500 holders, we are expecting xxx number of counts.

This code should also have explained a few other misunderstanding in your post.

Juan_814
  • 51
  • 8