How to use scale and shape parameters of gamma GLM in statsmodels

Question

The task

I have data that looks like this:

I want to fit a generalized linear model (glm) to this from a gamma family using statsmodels. Using this model, for each of my observations I want to calculate the probability of observing a value that is smaller than (or equal to) that value. In other words I want to calculate:

P(y <= y_i | x_i)

My questions

How do I get the shape and scale parameters from the fitted glm in statsmodels? According to this question the scale parameter in statsmodels is not parameterized in the normal way. Can I use it directly as input to a gamma distribution in scipy? Or do I need a transformation first?
How do I use these parameters (shape and scale) to get the probabilities? Currently I'm using scipy to generate a distribution for each x_i and get the probability from that. See implementation below.

My current implementation

import scipy.stats as stat
import patsy
import statsmodels.api as sm

# Generate data in correct form
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')

# Fit model with gamma family and log link
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()

# Predict mean
myData['mu'] = mod.predict(exog=X) 

# Predict probabilities (note that for a gamma distribution mean = shape * scale)
probabilities = np.array(
    [stat.gamma(m_i/mod.scale, scale=mod.scale).cdf(y_i) for m_i, y_i in zip(myData['mu'], myData['y'])]
)

However, when I perform this procedure I get the following result:

Currently the predicted probabilities all seem really high. The red line in the graph is the predicted mean. But even for points below this line the predicted cumulative probability is around 80%. This makes me wonder whether the scale parameter I used is indeed the correct one.

StupidWolf · Accepted Answer · 2020-10-05T08:46:55.487

In R, you can obtained as estimate of the shape using 1/dispersion (check this post).The naming of the dispersion estimate in statsmodels is a unfortunately scale. So you did to take the reciprocal of this to get the shape estimate. I show it with an example below:

values = gamma.rvs(2,scale=5,size=500)
fit = sm.GLM(values, np.repeat(1,500), family=sm.families.Gamma(sm.families.links.log())).fit()

This is an intercept only model, and we check the intercept and dispersion (named scale):

[fit.params,fit.scale]
[array([2.27875973]), 0.563667465203953]

So the mean is exp(2.2599) = 9.582131 and if we use shape as 1/dispersion , shape = 1/0.563667465203953 = 1.774096 which is what we simulated.

If I use a simulated dataset, it works perfectly fine. This is what it looks like, with a shape of 10:

from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import pandas as pd

_shape = 10
myData = pd.DataFrame({'x':np.random.uniform(0,10,size=500)})
myData['y'] = gamma.rvs(_shape,scale=np.exp(-myData['x']/3 + 0.5)/_shape,size=500)

myData.plot("x","y",kind="scatter")

Then we fit the model like you did:

y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
mu = mod.predict(exog=X) 

shape_from_model = 1/mod.scale

probabilities = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]

And plot:

fig, ax = plt.subplots()
im = ax.scatter(myData["x"],myData["y"],c=probabilities)
im = ax.scatter(myData['x'],mu,c="r",s=1)
fig.colorbar(im, ax=ax)

If I understand correctly, `mod.scale` is actually dispersion which is 1/shape. So `mod.scale = 1/shape`. If I use your code and check this, I indeed find `mod.scale = .1009`. What I don't understand is why you didn't change the calculation of `probabilities`. Right now you use `m_i/mod.scale` as shape parameter. But that's equal to `m_i/dispersion = m_i * shape = scale * shape^2`. As the scale you use `mod.scale = dispersion = 1/shape`. Should it not be `gamma(1/mod.scale, scale=m_i * mod.scale).cdf(y_i)`? — Willem, Oct 05 '20 at 08:43
yes you are right, sorry I wrote it quite late, I took the probabilities from the wrong notebook cell... I have updated the code. it should be correct now. Thanks for spotting the error! — StupidWolf, Oct 05 '20 at 08:47

How to use scale and shape parameters of gamma GLM in statsmodels

1 Answers1

Linked