Why do I get different results when I want to find the AIC of my model in Python and R?

Question

I have a dataset and I want to study the relationship between the age and the probability of using a credit card. So I decided to apply weighted logistic regression.

Python code:

#Imports

import numpy as np

import pandas as pd

import statsmodels.formula.api as smf

import statsmodels.api as sm

#Let's create a dataframe of our data

use = np.array([5,7,11,20,47,49,54,53,76])

total = np.array([30,33,35,40,67,61,63,59,80])

age = np.arange(16, 25)

p = use / total

df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "P": p})

#Let's create our model

logit_model = smf.glm(formula = "P ~ Age", data = df,
                  family = sm.families.Binomial(link=sm.families.links.Logit()),
                  var_weights = total).fit()

logit_model.summary()

Output

             Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      P   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -147.07
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        10:22:00   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

In this model, I want to calculate the AIC.

logit_model.aic

Output

298.1385764743574

(In the above model I used the argument var_weights as Josef suggested in this thread)

Let's do the same in R.

Rcode:

use = c(5,7,11,20,47,49,54,53,76)

total = c(30,33,35,40,67,61,63,59,80)

age = seq(16,24)

p = use/total

logit_temp = glm(p~age,family = binomial, weights = total)

logit_temp

Output

Call:  glm(formula = p ~ age, family = binomial, weights = total)

Coefficients:
(Intercept)          age  
   -11.3464       0.5996  

Degrees of Freedom: 8 Total (i.e. Null);  7 Residual
Null Deviance:      156.4 
Residual Deviance: 1.947    AIC: 40.12

As you can see now, the AIC of the model I created with R is very different from the AIC I found with Python. What should I change in order to have the same results?

"We all know that the log-likelihood of the saturated model is 0." I don't think that statement should be accepted as meaningful in this case. There is no saturated model. since you still have residual degrees of freedom. — IRTFM, May 07 '23 at 01:30
@RitchieSacramento I made some changes in order to apply weighted logistic regression but as you can see there is still a problem — Billy, May 08 '23 at 07:54
@Josef But I haven't used any `weights` argument in my Python code. I have used the `weights` argument in my **R** code — Billy, May 08 '23 at 15:55
@BenBolker Yes, I did in my Python code. I think that this argument fits in my situation better. Also `freq_weight` gave me a model with wrong degrees of freedom for the residuals. — Billy, May 08 '23 at 19:44
I was just responding to your comment "I haven't used any `weights` argument in my Python code. (I was thinking about "weights" in the broad sense, not in the sense of a specific argument name ...) I may just have been confused, looks like you may have been responding to a now-deleted comment from @Josef? — Ben Bolker, May 08 '23 at 20:03
Sorry about the deleted comment. I realized too late that I was looking at the R code. (With formulas the code of both packages looks close enough to get confused when not paying enough attention. :) weights - which weights? https://github.com/statsmodels/statsmodels/issues/2848 — Josef, May 09 '23 at 17:12

Ben Bolker · Answer 1 · 2023-05-08T14:28:28.973

2

The difference reduces to a difference in the log-likelihoods: both packages appear to (correctly) compute AIC as (-2*loglikelihood + 2*df).

The help for statsmodels GLM function says:

WARNING: Loglikelihood and deviance are not valid in models where scale is equal to 1 (i.e., Binomial, NegativeBinomial, and Poisson). If variance weights are specified, then results such as loglike and deviance are based on a quasi-likelihood interpretation. The loglikelihood is not correctly specified in this case, and statistics based on it, such AIC or likelihood ratio tests, are not appropriate.

I'm not quite sure what to do about that ...

(I spent a little while digging down to try to reproduce the log-likelihood of -147.07 from Python by computing a quasi-likelihood, but so far without success)

edited May 08 '23 at 14:28

answered May 08 '23 at 14:12

Ben Bolker

211,554
25
370
453

Thank you for providing this part of documentation! I want to ask you a question. If I create a model and the value of scale **is not** 1, is it appropriate to rely on the values of loglikelihood and deviance I see with the command `my_model.summary()`. Thanks in advance! – Billy May 09 '23 at 15:06
As far as I know, yes, but I'm not a Python/`statsmodels` expert – Ben Bolker May 09 '23 at 15:06

Josef · Accepted Answer · 2023-05-08T16:26:24.243

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

update
var_weights are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.

However, the example is just a standard binomial model.

In this case we can specify (success, failure) counts directly as dependent variable. This produces then the same estimate and same aic as the R version. The results are interpreted as standard maximum likelihood estimate in this case.

fail = total - use
df["fail"] = fail
logit_model2 = smf.glm(formula = "use + fail ~ Age", data = df,
                       family = m.families.Binomial(link=sm.families.links.Logit()),
                       ).fit()

print(logit_model2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:        ['use', 'fail']   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        11:46:25   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)

As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.

Thanks a lot for your help! I would like to express my concerns. More specifically, in the model you created you used the formula `"use + fail ~ Age"` which is equivalent of the formula `total ~ Age` and I feel that a model like this isn't very interpretable because with the formula you used I think that it is indirectly said that we don't care about how many people use a credit card in every age group because we only want the total number of people we asked in order to draw conclusions. Also, I think that this formula doesn't help us to interpret the value of the Age's coefficient. — Billy, May 09 '23 at 15:38
`+` in formula means concatenation. The dependent variable has two count columns (number of successes, number of failures). See docstring for GLM. This is the standard definition for random variable of a Binomial distribution which is a distribution for counts (and not directly for proportion) — Josef, May 09 '23 at 16:31
The parameters of Binomial with Logit link still has the interpretation of log-odds-ratio. — Josef, May 09 '23 at 16:34
Aside: I find using Binomial as a model for proportions and not for counts confusing. The random variable is counts. The statsmodels version is a hybrid (e.g. https://github.com/statsmodels/statsmodels/issues/5282 https://github.com/statsmodels/statsmodels/issues/2698 ) — Josef, May 09 '23 at 16:40

Billy · Answer 3 · 2023-05-12T10:32:41.627

I would like to provide an answer which is similar to Josef's answer for someone who wants to use sm.GLM() function instead of smf.glm()

#Imports

import numpy as np

import pandas as pd

import statsmodels.api as sm

#Let's create a dataframe of our data

use = np.array([5,7,11,20,47,49,54,53,76])

total = np.array([30,33,35,40,67,61,63,59,80])

age = np.arange(16, 25)

fail = total - use

df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "Fail": fail})

endog_star = df[["Use","Fail"]] #our response variable in form [success, failure]

exog_star = df["Age"] #our independent variable

exog_star = sm.add_constant(exog_star) #we have to include a constant because by default
#constant is not included

#Let's create our model

logit_model = sm.GLM(
    endog = endog_star,
    exog = exog_star,
    family = sm.families.Binomial(link=sm.families.links.Logit())).fit()

logit_model.summary()

logit_model.aic

Output:

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:        ['Use', 'fail']   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Fri, 12 May 2023   Deviance:                       1.9469
Time:                        12:53:23   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

40.11730698160573

Why do I get different results when I want to find the AIC of my model in Python and R?

3 Answers3