Why PanelOLS in python has different result than plm in R

Question

I am doing a fixed effect model using PanelOLS in Python, and I have use plm in R to validate my result, to my surprise, the coefficient and P values are different between these two, even though they are supposed to be the same?

Data is from R’s dataset

library(AER)
data(Fatalities)
# define the fatality rate
Fatalities$fatal_rate <- Fatalities$fatal / Fatalities$pop * 10000
# mandadory jail or community service
Fatalities$punish <- with(Fatalities, factor(jail == "yes" | service == "yes", 
                                             labels=c("no", "yes")))

Below is my Python code for PanelOLS

w1=Fatalities.set_index(["state", "year"])
mod = PanelOLS(w1['fatal_rate'], w1[['beertax','drinkage','punish', 'miles' ,
                                     'unemp','income']], entity_effects=True)
result=mod.fit(cov_type='clustered', cluster_entity=True)
display(result.summary)

Below is my R code for plm

fatalities_mod6 <- plm(fatal_rate ~ beertax + year + drinkage + punish + miles +
                         unemp + log(income), index=c("state", "year"), 
                       model="within", effect="twoways", data=Fatalities)

I also would like to ask about the cov_type in PanelOLS, as I understood, if I would like to have robust standard error and P value, I should use cov_type=’robust’, instead of ‘clustered’. But I see many fixed effect examples using ‘clustered’—which one should I use to get the correct standard error and p-values for the variables?

Output python `panelOLS` are:

Output of R `plm`:

Can you please provide the output for both of these commands? — MrFlick, Dec 24 '20 at 07:58
@MrFlick I have edited the original post and added the output for both commands. Currently I have "cluster" the residuals within entities ( cov_type='clustered'). When I use cov_type="robust", the standard error and p-values are different, I would like to know which one I should use. :) — Ltng, Dec 24 '20 at 11:06

jay.sf · Accepted Answer · 2020-12-26T06:25:19.893

Two issues, 1. you're using year variable in the plm formula which is redundant because it's already indexed, and 2. your Python PanelOLS code calculates individual fixed effects so far, I can replicate the Python estimates with plm using effect="individual".

library(plm)
fatalities_mod6 <- plm(fatal_rate ~ beertax + drinkage + punish + miles + 
                         unemp + log(income), index=c("state", "year"), 
                       model="within", effect="individual", data=Fatalities)

Furthermore Python's PanelOLS appears to use standard errors clustered on state applying the Arellano method using heteroscedasticity-consistent standard errors type 1 ("HC1").

round(summary(fatalities_mod6, 
        vcov=vcovHC.plm(fatalities_mod6, cluster="group", type="HC1",
                        method="arellano"))$coe, 4)
#             Estimate Std. Error t-value Pr(>|t|)
# beertax      -0.3664     0.2920 -1.2550   0.2105
# drinkage     -0.0378     0.0252 -1.4969   0.1355
# punishyes    -0.0342     0.0951 -0.3598   0.7193
# miles         0.0000     0.0000 -0.4217   0.6736
# unemp        -0.0196     0.0128 -1.5385   0.1250
# log(income)   0.6765     0.5424  1.2472   0.2134

This resembles the PanelOLS result of Python.

Edit

For the two-way fixed effects estimator of your data with cluster-robust standard errors, the code would be,

for Python:

mod = PanelOLS(w1['fatal_rate'], w1[['beertax','drinkage','punish', 'miles' ,
                                     'unemp','income']],
               entity_effects=True, time_effects=True)

and for R:

fatalities_mod6 <- plm(fatal_rate ~ beertax + drinkage + punish + miles + 
                             unemp + log(income), index=c("state", "year"), 
                           model="within", effect="twoways", data=Fatalities)
summary(fatalities_mod6, 
        vcov=vcovHC.plm(fatalities_mod6, cluster="group", type="HC1"))

Edit 2

Here follow comparisons of Python, R, and Stata, conducted with the grunfeld data that come with statsmodels in Python (they're slightly different from the data(Grunfeld, package="plm")).

PanelOLS (Python), plm (R), and xtreg with vce(cluster *clustvar*) (Stata) appear to apply slightly different methods to calculate cluster-robust standard errors (see linked docs for details).

Python:

from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data = data.set_index(['firm','year'])

import pandas as pd
data.to_csv("grunfeld.csv")  ## export data for R / Stata

from linearmodels import PanelOLS
mod = PanelOLS.from_formula('invest ~ value + capital + EntityEffects + TimeEffects', 
                            data=data)
print(mod.fit())
#             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
# ------------------------------------------------------------------------------
# value          0.1167     0.0129     9.0219     0.0000      0.0912      0.1422
# capital        0.3514     0.0210     16.696     0.0000      0.3099      0.3930

print(mod.fit(cov_type='robust'))
# value          0.1167     0.0191     6.1087     0.0000      0.0790      0.1544
# capital        0.3514     0.0529     6.6472     0.0000      0.2471      0.4557

print(mod.fit(cov_type='clustered', cluster_entity=True))
# value          0.1167     0.0113     10.368     0.0000      0.0945      0.1389
# capital        0.3514     0.0470     7.4836     0.0000      0.2588      0.4441

print(mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=True))
# value          0.1167     0.0117     10.015     0.0000      0.0937      0.1397
# capital        0.3514     0.0447     7.8622     0.0000      0.2633      0.4396

R:

grunfeld <- read.csv("V:/Python/fatalities/grunfeld.csv")

library(plm)
fit <- plm(invest ~ value + capital, grunfeld, effect="twoways", model="within",
           index=c("firm", "year"))

## resembles exactly py mod.fit()
round(summary(fit)$coe, 4)  
#         Estimate Std. Error t-value Pr(>|t|)
# value     0.1167     0.0129  9.0219        0
# capital   0.3514     0.0210 16.6964        0

## resembles approximately py mod.fit('cluster', cluster_entity=True) and Stata (below)
round(summary(fit, cluster="group", 
              vcov=vcovHC.plm(fit, method="arellano", type="HC2")
)$coe, 4) 
#         Estimate Std. Error t-value Pr(>|t|)
# value     0.1167     0.0114 10.2042        0
# capital   0.3514     0.0507  6.9364        0

## doesn't seem to change with two-way clustering
round(summary(fit, cluster=c("group", "time"),
              vcov=vcovHC.plm(fit, method="arellano", type="HC2")
)$coe, 4) 
#         Estimate Std. Error t-value Pr(>|t|)
# value     0.1167     0.0114 10.2042        0
# capital   0.3514     0.0507  6.9364        0

fit.1 <- lm(invest ~ value + capital + factor(firm) + factor(year) + 0, grunfeld)
library(lmtest)
## resembles exactly py mod.fit(cov_type='robust')
round(coeftest(fit.1, vcov.=sandwich::vcovHC(fit.1, type="HC1"))[1:2,], 4)
#         Estimate Std. Error t value Pr(>|t|)
# value     0.1167     0.0191  6.1087        0
# capital   0.3514     0.0529  6.6472        0

Stata:

import delim grunfeld.csv, clear
egen firmn = group(firm)
xtset firmn year

xtreg invest value capital i.year, fe vce(cluster firmn) 
#         |               Robust
#  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
# -------------+----------------------------------------------------------------
#   value |   .1166811   .0114755    10.17   0.000     .0911122    .1422501
# capital |   .3514357   .0478837     7.34   0.000     .2447441    .4581273

Thank you so much @jay.sf for your answer. I would like to ask also, that for this data, I guess the R code is the best fit for this panel data (which considers the year effect, effect=“twoways”)? About the standard error, I would like to know which standard error I should choose? Clustered error by entity or robust error? Is this R code coeftest(fatal_tefe_mod, vcov = vcovHC, type = "HC1") Equivalent to the python code: result=mod.fit(cov_type='Robust', cluster_entity=True)? — Ltng, Dec 24 '20 at 14:14
@Ltng According to the [`PanelOLS `documentation](https://bashtage.github.io/linearmodels/doc/panel/models.html) you may use `entity_effects=True, time_effects=True` which is `False` by default to get the within estimator. And, of course you should use cluster-robust standard errors; in R actually you could use my code but `effect="twoways"`. — jay.sf, Dec 24 '20 at 14:24
Hi @jay.sf. Thank you for your reply! For which code I can have cluster-robust standard error? (Can I know it in R and python?) — Ltng, Dec 24 '20 at 14:43
@Ltng See edits please, estimates and standard errors should match, right? — jay.sf, Dec 24 '20 at 14:59
@Ltng You're welcome, I consider this a _yes_. You could [accept the answer](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235#5235) now in this case. — jay.sf, Dec 24 '20 at 15:10
Hi Jay, @jay.sf I have tested the code in python and R, and they actually have different results in standard error and P-value (coefficients are the same), can I know the corresponding python code for the cluster-robust standard error you have in R? Python for cluster-robust standard errors, I tried: result=mod.fit(cov_type='clustered', cluster_entity=True) and result=mod.fit(cov_type='clustered', cluster_entity=True,cluster_time=True); and the R code is the same as your answer except that I changed => effect="twoways". but none of the two python result match the R result — Ltng, Dec 25 '20 at 08:47
@Ltng That's actually interesting. I compared the methods and can confirm your observations, that the statistics are slightly different. What you want is to 1. include both unit and time fixed effects into your model, for this I used the formula method and simpy added ` + EntityEffects + TimeEffects` (just see **Edit 2**), and 2. to use `print(mod.fit(cov_type='clustered', cluster_entity=True))`. I couldn't find the `Fatalites` data for python, so I gave an example using the `grunfield` data. I also linked the docs, and I think now you may chose a method that fits best to your needs. — jay.sf, Dec 25 '20 at 13:21
Hi @jay.sf. Thank you so much for the extensive documentations! This is super helpful! :) I also have another question about the best practice to choose the model and compare different models. Could you possibly take a look at that too? :) https://stackoverflow.com/questions/65446646/what-are-the-standard-panel-data-model-selections-and-steps — Ltng, Dec 25 '20 at 14:34
Hi @jay.sf I was hoping if I could borrow your expertise with a new question that I have posted recently? :) https://stackoverflow.com/questions/65528497/what-variables-to-include-in-fixed-effect-model-panel-data — Ltng, Jan 01 '21 at 09:28

Why PanelOLS in python has different result than plm in R

Output python panelOLS are:

Output of R plm:

1 Answers1

Edit

Edit 2

Output python `panelOLS` are:

Output of R `plm`: