What's the cleanest, most pythonic way to run a regression only on non-missing data and use clustered standard errors?
Imagine I have a Pandas dataframe all_data
.
Clunky method that works (make a dataframe without missing data):
I can make a new dataframe without the missing data, make the model, and fit the model:
import statsmodels.formula.api as smf
available_data = all_data.loc[:,['y', 'x', 'groupid']].dropna(how='any')
model = smf.ols('y ~ x', data = available_data)
result = model.fit(cov_type = 'cluster', cov_kwds={'groups': available_data['groupid']})
This feels a bit clunky (esp. when I'm doing it all over the place with different right hand side variables.) And I have to make sure that my stats formula matches the dataframe variables.
But is there a way to make it work using the missing argument?
I can make the model by setting the missing argument and fit the model.
m = smf.ols('y ~ x', data = all_data, missing = 'drop')
result_nocluster = m.fit()`
That works great for the default, homoeskedastic standard errors, but I don't know how to make this work with clustered standard errors? If I run:
result = m.fit(cov_type = 'cluster', cov_kwds = {'groups': all_data['groupid']})
I get the error ValueError: The weights and list don't have the same length.
Presumably the rows with missing observations aren't getting removed from all_data['groupid']
, so it's throwing an error.