Missing observations and clustered standard errors in Python statsmodels?

Question

What's the cleanest, most pythonic way to run a regression only on non-missing data and use clustered standard errors?

Imagine I have a Pandas dataframe all_data.

Clunky method that works (make a dataframe without missing data):

I can make a new dataframe without the missing data, make the model, and fit the model:

import statsmodels.formula.api as smf

available_data = all_data.loc[:,['y', 'x', 'groupid']].dropna(how='any')
model  = smf.ols('y ~ x', data = available_data)
result = model.fit(cov_type = 'cluster', cov_kwds={'groups': available_data['groupid']})

This feels a bit clunky (esp. when I'm doing it all over the place with different right hand side variables.) And I have to make sure that my stats formula matches the dataframe variables.

But is there a way to make it work using the missing argument?

I can make the model by setting the missing argument and fit the model.

m = smf.ols('y ~ x', data = all_data, missing = 'drop')
result_nocluster = m.fit()`

That works great for the default, homoeskedastic standard errors, but I don't know how to make this work with clustered standard errors? If I run:

result = m.fit(cov_type = 'cluster', cov_kwds = {'groups': all_data['groupid']})

I get the error ValueError: The weights and list don't have the same length. Presumably the rows with missing observations aren't getting removed from all_data['groupid'], so it's throwing an error.

There is currently no adjustment in the sandwich covariance computation to account for dropped observations. So, the first way of explicitly dropping observation, using dropna, is the only approach available. (I don't think it would be difficult to support this as long as the groups series has the same index as the original data.) — Josef, Apr 23 '17 at 20:56
Aside: If you want to compare models with different explanatory variables, then it would be better to remove the same observations in all models so that comparative statistics are not affected by changes in the dataset. — Josef, Apr 23 '17 at 21:37

score 1 · Accepted Answer · answered Jun 12 '20 at 10:49

(A bit too late but for the use of other users) In short, if you only want to use the missing argument in the smf.ols function, there is no way to make it work and, I think, there should not be one, given the current state of the package. The reason is exactly as you mentioned: "the rows with missing observations aren't getting removed" and they shouldn't. Because the missing argument creates a (lazy) copy of the input data without missing values and uses that as the input (input data: $X$, the lazy copy: $\hat{X}$). This process really should not remove the missing values from the original data ($X$)! At the same time, the groups array should refer to the same data that model uses, i.e., $\hat{X}$, however in you code the group variable is coming from the original data ($X$), which is different than the model data ($\hat{X}$).

One might argue that groups should accept just keywords. I guess that's something to be discussed more in depth over GitHub page of the package.

For now, one quick fix for your problem is to add a dropna to the second line, which defies the purpose. So it would look like this:

result = m.fit(cov_type = 'cluster',
               cov_kwds = {'groups': alldata[['y', 'x', 'groupid']].dropna()['groupid']})

Very ugly, inefficient, and mistake-prone! So, possibly your original chunky method would work better.

Missing observations and clustered standard errors in Python statsmodels?

Clunky method that works (make a dataframe without missing data):

But is there a way to make it work using the missing argument?

1 Answers1

Linked