forecast method from statsmodels gives me in-sample forecast instead of out-of-sample

Question

I am trying to get a one-step ahead forecast of an ARIMA model (using the SARIMAX object) for daily stock market data.

This is my code for the model:

df_train.index = pd.DatetimeIndex(df_train.index).to_period('D')
training_mod = sm.tsa.SARIMAX(df_train, order=model.order)
training_res = training_mod.fit()

The index of df_train is:

PeriodIndex(['2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05',
             '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11',
             '2018-01-12', '2018-01-16',
             ...
             '2021-09-14', '2021-09-15', '2021-09-16', '2021-09-17',
             '2021-09-20', '2021-09-21', '2021-09-22', '2021-09-23',
             '2021-09-24', '2021-09-27'],
            dtype='period[D]', name='Date', length=941)

Since I am fitting the model to df_train, the forecast method with base arguments should return the forecast for the date '2021-09-28' given that it is daily data.

The problem is that when I try running this line:

training_res.forecast()

It returns this a forecast for the day '2020-07-31':

2020-07-31    0.022581
Freq: D, dtype: float64

I have tried specifying the number of steps in the forecast method.

training_res.forecast(1)

Output:

2020-07-31    0.022581
Freq: D, dtype: float64

training_res.forecast(10)

Output:

2020-07-31    0.022581
2020-08-01   -0.258066
2020-08-02    0.031083
2020-08-03    0.231221
2020-08-04   -0.075070
2020-08-05   -0.197679
2020-08-06    0.108804
2020-08-07    0.160034
2020-08-08   -0.132281
2020-08-09   -0.120677
Freq: D, Name: predicted_mean, dtype: float64

Finally, I have also tried specifying the start date and end date instead of giving a horizon for the forecast, but it gives a new problem:

start_date = pd.to_datetime('2021-09-28')
end_date = pd.to_datetime('2021-10-05')
training_res.forecast(start= start_date, end= end_date)

Output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[172], line 3
      1 start_date = pd.to_datetime('2021-09-28')
      2 end_date = pd.to_datetime('2021-10-05')
----> 3 training_res.forecast(start= start_date, end= end_date)

File d:\miniconda3\envs\stocks\lib\site-packages\statsmodels\base\wrapper.py:113, in make_wrapper..wrapper(self, *args, **kwargs)
    111     obj = data.wrap_output(func(results, *args, **kwargs), how[0], how[1:])
    112 elif how:
--> 113     obj = data.wrap_output(func(results, *args, **kwargs), how)
    114 return obj

File d:\miniconda3\envs\stocks\lib\site-packages\statsmodels\tsa\statespace\mlemodel.py:3442, in MLEResults.forecast(self, steps, **kwargs)
   3440 else:
   3441     end = steps
-> 3442 return self.predict(start=self.nobs, end=end, **kwargs)

TypeError: statsmodels.tsa.statespace.mlemodel.MLEResults.predict() got multiple values for keyword argument 'start'

I don't see where I am giving multiple values for start argument.

Same thing happens when I pass in a period object:

start_date = pd.Period('2021-09-28', freq='D')
end_date = pd.Period('2021-10-05', freq='D')
training_res.forecast(start=start_date, end=end_date)

Output:

TypeError: statsmodels.tsa.statespace.mlemodel.MLEResults.predict() got multiple values for keyword argument 'start'

Also same thing happens when I pass in a string:

training_res.forecast(start= '2021-09-29', end= '2021-10-05')

Output:

TypeError: statsmodels.tsa.statespace.mlemodel.MLEResults.predict() got multiple values for keyword argument 'start'

cfulton · Answer 1 · 2023-05-13T03:03:00.453

0

Update: This looks like a bug in Statsmodels, that happens when a date index has missing entries (e.g. you are using business days, so you have missing entries for Sat/Sun every week): https://github.com/statsmodels/statsmodels/issues/6247

Original answer:

This question will probably require a minimal working example to help with. Can you post your dataset? If the data itself is confidential, you could replace the actual values with zeros, as long as you keep the index the same.

edited May 13 '23 at 03:03

answered May 07 '23 at 16:34

cfulton

2,855
2
14
13

I am not sure how I can upload a file to stack overflow, but you can easily get the dataset by running this code: 'import pandas_datareader.data as pdr' 'import yfinance as yfin' 'yfin.pdr_override()' 'start = str(dt.date(2018, 1, 1))' 'end = str(dt.date.today())' 'df_LVMH = pdr.get_data_yahoo('LVMHF', start=start, end=end)' – zaidmehdi May 08 '23 at 17:48

forecast method from statsmodels gives me in-sample forecast instead of out-of-sample

1 Answers1