How to apply a mask to a DataFrame in Python?

Question

My dataset named ds_f is a 840x57 matrix which contains NaN values. I want to forecast a variable with a linear regression model but when I try to fit the model, I get this message "SVD did not converge":

X = ds_f[ds_f.columns[:-1]]
y = ds_f['target_o_tempm']
model = sm.OLS(y,X) #stackmodel
f = model.fit() #ERROR

So I've been searching for an answer to apply a mask to a DataFrame. Although I was thinking of creating a mask to "ignore" NaN values and then convert it into a DataFrame, I get the same DataFrame as ds_f, nothing changes:

m = ma.masked_array(ds_f, np.isnan(ds_f))
m_ds_f = pd.DataFrame(m,columns=ds_f.columns)

EDIT: I've solved the problem by writing model=sm.OLS(X,y,missing='drop') but a new problem appears when I display results, I get only NaN:

Are you using `statsmodels`? If so, you could specify `sm.OLS(y, X, missing='drop')`, to drop the `NaN` values prior to estimation. Alternatively, you may want to consider interpolating the missing values, rather than dropping them. — Nelewout, Mar 13 '16 at 12:43
You've made my day, I should have explored the statsmodels prior to this question. Thank you very much! — florian, Mar 13 '16 at 12:54
Let me post that as an answer, so you can close the question! I'm glad you managed to resolve this :). — Nelewout, Mar 13 '16 at 14:53
ACtually I'm skeptical regarding the results when using drop method... And I can't really interpolate considering the data content ... (check at the edit above) — florian, Mar 13 '16 at 15:21
The problem for inference as displayed in the summary table most likely comes because you fit the data exactly and the error variance is zero. You have the same number of variables as observations. (nans come most likely from zero division somewhere) It should still be possible to use `predict`. You can verify on some examples if it estimates as expected, for example by disturbing one value slightly. — Josef, Mar 13 '16 at 21:54
But this should be a separate question because it is independent from the missing value handling, as answered by N Wouda. — Josef, Mar 13 '16 at 21:56

score 2 · Accepted Answer · answered Mar 13 '16 at 14:53

2

Are you using statsmodels? If so, you could specify sm.OLS(y, X, missing='drop'), to drop the NaN values prior to estimation.

Alternatively, you may want to consider interpolating the missing values, rather than dropping them.

answered Mar 13 '16 at 14:53

Nelewout

6,281
3
29
39

How to apply a mask to a DataFrame in Python?

1 Answers1