2

I'm trying to predict temperature at 12 UTC tomorrow in 1 location. To forecast, I use a basic linear regression model with the statmodels module. My code is hereafter:

x = ds_main
X = sm.add_constant(x)
y = ds_target_t
model = sm.OLS(y,X,missing='drop')
results = model.fit()

The summary shows that the fit is "good":

enter image description here

But the problem appears when I try to predict values with a new dataset that I consider to be my testset. The latter has the same columns number and the same variables names, but the .predict() function returns an array of NaN, although my testset has values ...

xnew = ts_main
Xnew = sm.add_constant(xnew)
ynewpred = results.predict(Xnew)

I really don't understand where the problem is ...

UPDATE : I think I have an explanation: my Xnew dataframe contains NaN values. Statmodels function .fit() allows to drop missing values (NaN) but not .predict() function. Thus, it returns a NaN values array ...

But this is the "why", but I still don't get the "how" reason to fix it...

florian
  • 881
  • 2
  • 8
  • 24
  • And I don't have any Inf values in my datasets .... Only floats and NaN... – florian Mar 16 '16 at 17:44
  • Do you still get the correct prediction for all rows that don't have any nans in them? I think this is the correct behavior, if predict drops rows, then the user doesn't know which prediction is for which Xnew row. – Josef Mar 16 '16 at 23:44
  • if I delete all NaN values in my Xnew dataset (for the prediction), it predicts correctly. So the problem really does come from the .predict() function, it just cannot consider NaN values, in the opposite of the .fit() function. Do you think it may come from Statsmodels ? – florian Mar 19 '16 at 08:02
  • I don't think I understand the question. statsmodels predict methods are **supposed** to return nan prediction for all rows with at least one nan, and the correct values for rows that don't have any nans. – Josef Mar 19 '16 at 13:09

1 Answers1

1

statsmodels.api.OLS be default will not accept the data with NA values. So if you use this, then you need to drop your NA values first.

However, if you use statsmodels.formula.api.ols, then it will automatically drop the NA values to run regression and make predictions for you.

so you can try this:

import statsmodels.formula.api as smf
lm = smf.ols(formula = "y~X", pd.concat([y, X], axis = 1)).fit()
lm.predict(Xnew)
Eiko
  • 25,601
  • 15
  • 56
  • 71
pinseng
  • 301
  • 2
  • 6
  • 11