R: difference between plm and LSDV model

Question

I'm just starting to wrap my head around fixed effects so apologies if the questions are redundant. Based on the Panel101 slides by Oscar Torres-Reyna (https://www.princeton.edu/~otorres/Panel101R.pdf), I am comparing the output of two different codes:

lsdv <- lm(formula=dependent_variable~poly(log(independent_variable1)degree = 2, raw=TRUE) + poly(log(independent_variable2)degree = 2, raw=TRUE) + factor(country) -1, data=mydata)
plm <- plm(formula=dependent_variable ~ poly(log(independent_variable1)degree = 2, raw=TRUE) + poly(log(independent_variable2), data=mydata, model="within, index=c(country)

In line with the Panel101 slides, both models produce the exact same coefficients but the adjusted R2 differs vastly (0.954 vs. 0.119).

Am I doing something wrong or how can this be explained?

Thanks!

For fixed effects, you are doing regression on transformed data, so the dependent variable is $[y_{it} - \overline{y_i}]$. With LSDV, you use the original observed data (and add individual dummies). Because of the transformation in FE, the variability in dependent variable changes as well as statistics such as $R^2$. — Tomas, Nov 09 '18 at 08:47

Otto Kässi · Accepted Answer · 2018-11-09T11:42:07.430

(I was planning to comment, but this came out too long....)

The summary of the lm model reports the R2 for a model of the form (using only one dependent var for simplicity)

lm(dependent_variable + independent_variable + factor(country))

The output of the plm model reports the R2 from the model

lm(dependent_var_demean ~ independent_var_demean)

Where the independent_var_demean and dependent_var_demean are calculated by subtracting the country specific means of the dependent and independent vars from each observation.

As it turns out, the regression coefficient on the independent_var is identical the same in the two cases. The R2 in the first model is much larger, as it has N+1 explanatory variables while the second model only has 1.

Which of the R2's is 'correct' then? This depends on the context. If you treat the individual FE's as nuisance parameters and are only interested in the regression coefficient on independent_variable, you would be more consistent in reporting the R2 from the within model (or the 'plm output'). In some applications, individual FE's might also be interesting as they proxy some unobserved qualities which affect both dependent and independent var. In this case, the LSDV R2 (reported by lm) might be more relevant.

Nonetheless, it should be mentioned, that in typical large-N/small-T (i.e. many units observed only a few times) situations, the individual FE estimates might be biased. This is known as the incidental parameters problem.

Finally, I think that I need to give a small shoutout to the lfe package for doing fixed effects regressions. It is very efficient with large panels, the syntax is IMO nicer than in plm, and clustered and robust standard errors are handled more elegantly compared to plm. It also reports both R2's in the summary output.

R: difference between plm and LSDV model

1 Answers1