Multicollinearity in Panel Data in Python

Question

I am used to using Stata or R to do linear regression models but I am transitioning more workflow over to Python.

The useful thing about these two programs is that they intuitively know that you do not care about all of the entity- or time-fixed effects in a linear model, so when estimating panel models, they will drop multicollinear dummies from the model (reporting which ones they drop).

While I understand that estimating models in such a way is not ideal and one should be careful about regressions to run (etc), this is useful in practice, because it means that you can see results first, and worry about some of the nuances of the dummies later (especially since you don't care about dummies in a fully saturated Fixed-Effects model).

Let me provide an example. The following requires linearmodels and loads a dataset and attempts to run a panel regression. It is a modified version of the example from their documentation.

# Load the data (requires statsmodels and linearmodels)
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
import pandas as pd
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(['nr', 'year'])
data['year'] = year
print(wage_panel.DESCR)
print(data.head())

# Run the panel regression
from linearmodels.panel import PanelOLS
exog_vars = ['exper','union','married']
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)

This gives the following error:

AbsorbingEffectError: The model cannot be estimated. The included effects have fully absorbed one or more of the variables. This occurs when one or more of the dependent variable is perfectly explained using the effects included in the model.

However, if you estimate in Stata by exporting the same data to Stata, running:

data.drop(columns='year').to_stata('data.dta')

And then running the equivalent in your stata file (after loading in the data):

xtset nr year
xtreg lwage exper union married i.year, fe

This will do the following:

> . xtreg lwage exper union married i.year, fe
note: 1987.year omitted because of collinearity

Fixed-effects (within) regression               Number of obs      =      4360
Group variable: nr                              Number of groups   =       545

R-sq:  within  = 0.1689                         Obs per group: min =         8
       between = 0.0000                                        avg =       8.0
       overall = 0.0486                                        max =         8

                                                F(9,3806)          =     85.95
corr(u_i, Xb)  = -0.1747                        Prob > F           =    0.0000

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       exper |   .0638624   .0032594    19.59   0.000     .0574721    .0702527
       union |   .0833697   .0194393     4.29   0.000     .0452572    .1214821
     married |   .0583372   .0183688     3.18   0.002     .0223235    .0943509
             |
        year |
       1981  |   .0496865   .0200714     2.48   0.013     .0103348    .0890382
       1982  |   .0399445    .019123     2.09   0.037     .0024521    .0774369
       1983  |   .0193513    .018662     1.04   0.300    -.0172373    .0559398
       1984  |   .0229574   .0186503     1.23   0.218    -.0136081    .0595229
       1985  |   .0081499   .0191359     0.43   0.670    -.0293677    .0456674
       1986  |   .0036329   .0200851     0.18   0.856    -.0357456    .0430115
       1987  |          0  (omitted)
             |
       _cons |   1.169184   .0231221    50.57   0.000     1.123851    1.214517
-------------+----------------------------------------------------------------
     sigma_u |  .40761229
     sigma_e |  .35343397
         rho |  .57083029   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Notice that stata arbitrarily dropped 1987 from the regression, but still ran. Is there a way to get similar functionality in linearmodels or statsmodels?

sorry, you are right, I had already pre-loaded the data when I ran this. If you run the above lines it will execute properly — Lucas H, Mar 08 '19 at 22:24
sorry, no. I added a code to export the data to a stata dataset though so you can load it in — Lucas H, Mar 08 '19 at 23:00
really? haha that's a little surprising, and disappointing. how do you know? — Lucas H, Mar 09 '19 at 00:24
I don't think so. Stata is really good at making smart decisions when it comes to dealing with collinearity and getting things to run. `statsmodels` actually has an [issue with perfect multicollinearity](http://www.statsmodels.org/stable/pitfalls.html) — ALollz, Mar 09 '19 at 04:22

ALollz · Accepted Answer · 2019-03-09T19:30:45.390

The only way I can think to do it is manually.

Sample Data

import pandas as pd
import numpy as np
import statsmodels.api as sm

from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS

data = wage_panel.load()

First, we'll follow in Stata's footsteps, generating dummies for each of the year fixed effects and we leave out the first value, lexicographically sorted, (accomplished with the drop_first=True argument). It's important to use np.unique to then get the labels, as this sorts too. No need for statsmodels to add a constant, just do it yourself:

data = wage_panel.load()
data = pd.concat([data, pd.get_dummies(data.year, drop_first=True)], axis=1)
exog_vars = ['exper','union','married'] + np.unique(data.year)[1::].tolist()
data = data.set_index(['nr', 'year'])

exog = data[exog_vars].assign(constant=1)

If we try to run this model, it fails because of perfect collinearity. Because we're doing a within-regression we can't just test collinearity on exog, we need to first de-mean within the panel, as the collinearity of the de-meaned independent variables is what matters. I'll make a copy here as to not mess with exog which we will use in the ultimate regression.

exog2 = exog.copy()
exog2 = exog2 - exog2.groupby(level=0).transform('mean') + exog2.mean()

We can now see there is perfect collinearity; when we regress a within de-meaned exog variable against every other variable we get a perfect R-squared for some regressions:

for col in exog2:
    print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)

#exper 1.0
#union 0.004326532427527674
#married 0.18901677578724896
#1981 1.0
#1982 1.0
#1983 1.0
#1984 1.0
#1985 1.0
#1986 1.0
#1987 1.0

Now how Stata or other software packages decide which variable to drop is a mystery to me. In this case, you'd probably favor dropping one of your year dummies over the other variables. Let's just pick 1987 so we can get the same result as Stata in the end.

exog2 = exog2.drop(columns=1987)
for col in exog2:
    print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)

#exper 0.48631
#union 0.0043265
#married 0.1890
#1981 0.34978
#1982 0.28369
#1983 0.2478680
#1984 0.2469180
#1985 0.2846552
#1986 0.35067075
#constant 0.94641

So we've dealt with the collinearity and can go back to the model. As we've included the yearly Fixed Effects manually, we can remove time_effects from the model.

mod = PanelOLS(data.lwage, exog.drop(columns=1987), entity_effects=True, time_effects=False)
print(mod.fit())

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1689
Estimator:                   PanelOLS   R-squared (Between):             -0.0882
No. Observations:                4360   R-squared (Within):               0.1689
Date:                Sat, Mar 09 2019   R-squared (Overall):              0.0308
Time:                        00:59:14   Log-likelihood                   -1355.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      85.946
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(9,3806)
Min Obs:                       8.0000                                           
Max Obs:                       8.0000   F-statistic (robust):             85.946
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(9,3806)
Avg Obs:                       545.00                                           
Min Obs:                       545.00                                           
Max Obs:                       545.00                                           

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exper          0.0639     0.0033     19.593     0.0000      0.0575      0.0703
union          0.0834     0.0194     4.2887     0.0000      0.0453      0.1215
married        0.0583     0.0184     3.1759     0.0015      0.0223      0.0944
1981           0.0497     0.0201     2.4755     0.0133      0.0103      0.0890
1982           0.0399     0.0191     2.0888     0.0368      0.0025      0.0774
1983           0.0194     0.0187     1.0369     0.2998     -0.0172      0.0559
1984           0.0230     0.0187     1.2309     0.2184     -0.0136      0.0595
1985           0.0081     0.0191     0.4259     0.6702     -0.0294      0.0457
1986           0.0036     0.0201     0.1809     0.8565     -0.0357      0.0430
constant       1.1692     0.0231     50.566     0.0000      1.1239      1.2145
==============================================================================

Which matches the Stata output. There wasn't really any reason to drop 1987, we could have chosen something else, but at least with this we can see the results match xtreg.

Yes. I think this would be great if someone could shed light on how Stata decides which dummies to drop. This makes a lot of sece though, and thanks for weighing in :) — Lucas H, Mar 10 '19 at 18:48
@LucasH Yeah, I tried to look but couldn’t find anything. My suspicion is that they never drop the constant, and favor dummies `i.` variables over non-dummies. Perhaps they use the Variance Inflation Factor to decide which to drop iteratively once they have found collinearity. In this case I believe 1987 has the highest VIF, excluding the constant, so probably the best one to drop, given we have collinearity. — ALollz, Mar 10 '19 at 19:41

Multicollinearity in Panel Data in Python

1 Answers1

Sample Data