0

For a fixed effect model I was planning to switch from Stata's areg to Python's linearmodels.panel.PanelOLS.

But the results are different. In Stata I get R-squared = 0.6047 and in Python I get R-squared = 0.1454.

How come that I get so different R-squared from the commands below?

Stata command and results:

use ./linearmodels_datasets_wage_panel.dta, clear
areg lwage expersq union married hours, vce(cluster nr) absorb(nr)

Linear regression, absorbing indicators             Number of obs     =  4,360
Absorbed variable: nr                               No. of categories =    545
                                                    F(4, 544)         =  84.67
                                                    Prob > F          = 0.0000
                                                    R-squared         = 0.6047
                                                    Adj R-squared     = 0.5478
                                                    Root MSE          = 0.3582

                                   (Std. err. adjusted for 545 clusters in nr)
------------------------------------------------------------------------------
             |               Robust
       lwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     expersq |   .0039509   .0002554    15.47   0.000     .0034492    .0044526
       union |   .0784442   .0252621     3.11   0.002      .028821    .1280674
     married |   .1146543   .0234954     4.88   0.000     .0685014    .1608072
       hours |  -.0000846   .0000238    -3.56   0.000    -.0001313   -.0000379
       _cons |   1.565825   .0531868    29.44   0.000     1.461348    1.670302
------------------------------------------------------------------------------

Python command and results:

from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS

data = wage_panel.load()

mod_entity = PanelOLS.from_formula(
    "lwage ~ 1 + expersq + union + married + hours + EntityEffects",
    data=data.set_index(["nr", "year"]),
)

result_entity = mod_entity.fit(
    cov_type='clustered',
    cluster_entity=True,
)

print(result_entity)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1454
Estimator:                   PanelOLS   R-squared (Between):             -0.0844
No. Observations:                4360   R-squared (Within):               0.1454
Date:                Wed, Feb 02 2022   R-squared (Overall):              0.0219
Time:                        12:23:24   Log-likelihood                   -1416.4
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      162.14
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(4,3811)
Min Obs:                       8.0000                                           
Max Obs:                       8.0000   F-statistic (robust):             96.915
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(4,3811)
Avg Obs:                       545.00                                           
Min Obs:                       545.00                                           
Max Obs:                       545.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      1.5658     0.0497     31.497     0.0000      1.4684      1.6633
expersq        0.0040     0.0002     16.550     0.0000      0.0035      0.0044
hours       -8.46e-05   2.22e-05    -3.8101     0.0001     -0.0001  -4.107e-05
married        0.1147     0.0220     5.2207     0.0000      0.0716      0.1577
union          0.0784     0.0236     3.3221     0.0009      0.0321      0.1247
==============================================================================

F-test for Poolability: 9.4833
P-value: 0.0000
Distribution: F(544,3811)

Included effects: Entity
Wuff
  • 257
  • 1
  • 8
  • 1
    The rsquared definitions differ. See the [documentation](https://bashtage.github.io/linearmodels/panel/mathematical-formula.html#r-2-calculation) for details on how the R2 are related, which to use that will resemble Stata. – Kevin S Feb 03 '22 at 00:18
  • @KevinS Thank you for your comment, but the documentation says for `R-squared (Between)`: "This measure matches Stata.". As you can see above *Stata*'s `R-squared = 0.6047` and *Python*'s `R-squared (Between) = -0.0844`. So in this case they don't seem to match. But I really don't know why. – Wuff Feb 03 '22 at 09:00
  • 1
    You need to use `xtreg` to get the match in Stata. `areg` and `xtreg` do not agree. If you use `xtreg` you will see R-sq: `within = 0.1454`, `between = 0.0004`,`overall = 0.0418` @wuff – Kevin S Feb 03 '22 at 13:17
  • @KevinS thanks for clarifying! Now I also checked the output of `reghdfe` which gives me that most similar results to `linearmodels` (i.e. `t-stat`, etc.) and I realize that I should give it another thought which R-squared to report in my case. – Wuff Feb 03 '22 at 14:11
  • Just stumbled upon `rsquared_inclusive` [here](https://bashtage.github.io/linearmodels/devel/panel/pandas.html?highlight=rsquared_inclusive), which is what `areg` returns. But I really just stumbled upon it by accident though. – Wuff Feb 03 '22 at 14:17
  • @KevinS if you use your two comments in an answer I'll accept it, so you can get the credit ;) – Wuff Feb 03 '22 at 22:08

1 Answers1

0

man. How are you?

You are trying to run an absorbing regression (.areg). Specifically, you're trying to run 'a linear regression absorbing one categorical factor'. To do this, you can just run the following model linearmodels.iv.absorbing.AbsorbingLS(endog_variable, exog_variables, categorical_variable_absorb)

See the example below:

import pandas as pd
import statsmodels as sm
from linearmodels.iv import absorbing

dta = pd.read_csv('http://www.math.smith.edu/~bbaumer/mth247/labs/airline.csv')

dta.rename(columns={'I': 'airline', 
                    'T': 'year', 
                    'Q': 'output', 
                    'C': 'cost', 
                    'PF': 'fuel', 
                    'LF ': 'load'}, inplace=True)

Next, transform the absorbing variable into a categorical variable (in this case, I will use the airline variable):

cats = pd.DataFrame({'airline': pd.Categorical(dta['airline'])})

Then, just run the model:

exog_variables = ['output', 'fuel', 'load']
endog_variable = ['cost']

exog = sm.tools.tools.add_constant(dta[exog_variables])
endog = dta[endog_variable]

model = absorbing.AbsorbingLS(endog, exog, absorb=cats, drop_absorbed=True)
model_res = model.fit(cov_type='unadjusted', debiased=True)

print(model_res.summary)

Below is the results of this same model in both python and stata (using the command .areg cost output fuel load, absorb(airline))

Python:

                         Absorbing LS Estimation Summary                          
==================================================================================
Dep. Variable:                   cost   R-squared:                          0.9974
Estimator:               Absorbing LS   Adj. R-squared:                     0.9972
No. Observations:                  90   F-statistic:                        3827.4
Date:                Thu, Oct 27 2022   P-value (F-stat):                   0.0000
Time:                        20:58:04   Distribution:                      F(3,81)
Cov. Estimator:            unadjusted   R-squared (No Effects):             0.9926
                                        Varaibles Absorbed:                 5.0000
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          9.7135     0.2229     43.585     0.0000      9.2701      10.157
output         0.9193     0.0290     31.691     0.0000      0.8616      0.9770
fuel           0.4175     0.0148     28.303     0.0000      0.3881      0.4468
load          -1.0704     0.1957    -5.4685     0.0000     -1.4599     -0.6809
==============================================================================

Stata:

Linear regression, absorbing indicators Number of obs = 90
 F( 3, 81) = 3604.80
 Prob > F = 0.0000
 R-squared = 0.9974
 Adj R-squared = 0.9972
 Root MSE = .06011
------------------------------------------------------------------------------
 cost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 output | .9192846 .0298901 30.76 0.000 .8598126 .9787565
 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 
 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963
 _cons | 9.713528 .229641 42.30 0.000 9.256614 10.17044
-------------+----------------------------------------------------------------
 airline | F(5, 81) = 57.732 0.000 (6 categories)