0

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.

fixest::feglm() can handle this and returns my fitted model. However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.

In my case, I am afraid this implies not using a significant share of available information in the data. Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.

Question:

How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?

To use the example from fixest::demean documentation (with two-way fixed-effects):

data(trade)

base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)

# We center the two variables ln_dist and ln_euros
#  on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
                  fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean

and I would like to reproduce

est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)

with

est = fepois(ln_euros_dm ~ ln_dist_dm, base)

1 Answers1

0

I think there are two main problems.

Modelling strategy

In general, it is important to be able to formally describe the estimated model. In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.

On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.

By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.

Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.

GLM

There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.

GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).

So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

Laurent Bergé
  • 1,292
  • 6
  • 8
  • Thanks for clearly distinguishing the modelling and the algorithmic problems and pointing me to useful material on GLM. Indeed, my motivation to not remove randomly missing observations within the fixed-effect dimension is to reduce the coefficients' variances. Is it possible with {fixest} to access the OLS step and implement a different demeaning (one that does not remove partially missing observations before demeaning)? If not, is it first because of a practical or a conceptual issue? – Valentin Guye Sep 18 '20 at 13:31
  • In programming, everything is possible. But in this case, there's a big conceptual problem. This is because in GLM you need the weights, and the OLS steps depends on the weights (meaning you should apply a weighted demeaning). And you won't be able to have weights for the observations with missing values. So here it seems to me what you want to do is just impossible. – Laurent Bergé Sep 19 '20 at 07:08