How do I aggregate data for glm() function in R

Question

I am trying to estimate relativities for insurance pricing using a glm. I'm using the "freMPTL" in CASdatasets. ClaimNb is my response, Exposure is my Exposure, I'm interested in ClaimNb/Exposure.

After dividing the larger categories such as driver age (18-99) into smaller groups of ex. 5 categories, I grouped the data using

data_grouped_freq <- data_freq4 %>%
  group_by(Power, Brand, Gas, Region, CarAge_cat, DriverAge_cat, Density_cat) %>%
  summarise(ClaimNb  = sum(ClaimNb),
            Exposure = sum(Exposure))

after which I use the command

model_freq <- glm(ClaimNb ~ Power + Brand + Gas + Region + CarAge_cat + DriverAge_cat + Density_cat,
 family = poisson, data = data_grouped_freq, weights = Exposure)
    summary(model_freq)

to plot a glm. The result is then

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-255.241    -2.634    -0.929    -0.202   199.629  

Coefficients:
                                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              4.8629082  0.0011698 4156.99   <2e-16 ***
Powerd                                  -0.4660131  0.0014613 -318.90   <2e-16 ***
Powere                                  -0.7155983  0.0013723 -521.44   <2e-16 ***
Powerg                                  -0.4131892  0.0010905 -378.89   <2e-16 ***
...
RegionPoitou-Charentes                  -2.3903228  0.0052288 -457.14   <2e-16 ***
CarAge_cat1                             -1.2547176  0.0021645 -579.68   <2e-16 ***
DriverAge_cat1                          -0.7913098  0.0022811 -346.90   <2e-16 ***
DriverAge_cat2                          -1.2886084  0.0024688 -521.96   <2e-16 ***

I know that this is wrong because DriverAge_cat1 has a higher ratio of ClaimNb/Exposure and should thus result in a relativity>1, which exp(-18.9082) is not. (The ratio of ClaimNb/Exposure for cat1 is 0.134 compared to 0.071 in the reference group of DriverAge_cat1)

Can someone explain what I am doing wrong? Is it perhaps the fact that there are a lot of categories with 0 Claims causing problems? Maybe i'm treating weights wrong? There are 14661 total cells across 7 variables.

Try fitting a univariable model with just your outcome and `DriverAge_cat1`. If it is what you expect (i.e. relativity greater than 1), then your model could be working correctly and the additional variables in your multivariate model explain the negative direction of effect size. That is, yes `DriverAge_cat1` positive only when the other variables are not considered. — JustGettinStarted, May 17 '20 at 14:03
@JustGettingStarted, after trying that it still gave me a negative value for DriverAge_cat1, even without other variables in consideration, so i probably am doing something wrong. Thank you for the suggestion though. — William, May 17 '20 at 14:18

score 0 · Answer 1 · edited Aug 15 '20 at 07:38

0

In your GLM code for creating Poisson Rate model you should use parameter offset -

model_freq <- glm(ClaimNb ~ Power + Brand + Gas + Region + CarAge_cat + DriverAge_cat + Density_cat,
 family = poisson, data = data_grouped_freq, offset= log(Exposure))

the above modified code should solve your issue.

edited Aug 15 '20 at 07:38

Arghya Sadhu

41,002
9
78
107

answered Aug 15 '20 at 07:16

volintine

1

How do I aggregate data for glm() function in R

1 Answers1