3

Python 3.6

My dataset looks like this:

It's travel bookings, say for a travel company e.g. airlines/trains/buses etc.

date           bookings
2017-01-01     438
2017-01-02     167
...
2017-12-31     45
2018-01-01     748
...
2018-11-29     223

I need something like this (i.e. forecasted data beyond dataset):

date           bookings
2017-01-01     438
2017-01-02     167
...
2017-12-31     45
2018-01-01     748
...
2018-11-29     223
2018-11-30     98
...
2018-12-30     73
2018-12-31     100

Code:

import pyodbc
import pandas as pd
import cufflinks as cf
import plotly.plotly as ply
from pmdarima.arima import auto_arima

sql_conn = pyodbc.connect(# connection details here)
query = #sql query here
df = pd.read_sql(query, sql_conn, index_col='date')
df.index = pd.to_datetime(df.index)

stepwise_model = auto_arima(df, start_p=1, start_q=1,
                           max_p=3, max_q=3, m=7,
                           start_P=0, seasonal=True,
                           d=1, D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True, 
                           stepwise=True)
stepwise_model.aic()

train = df.loc['2017-01-01':'2018-06-30']
test = df.loc['2018-07-01':]

stepwise_model.fit(train)
future_forecast = stepwise_model.predict(n_periods=len(test))
future_forecast = pd.DataFrame(future_forecast,
                               index=test.index,
                               columns=['prediction'])
pd.concat([test, future_forecast], axis=1).iplot()

Result enter image description here

As you can see prediction is way off and I assume the problem is not using the right auto_arima parameters. What is the best way to get these parameters? I could perhaps trial and error but it would be good to get an understanding of the standard/non-standard procedure in obtaining the best fit.

Any help would be much appreciated.

Sources:

Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
AK91
  • 671
  • 2
  • 13
  • 35

1 Answers1

2

You have a structural break around Aug 2018, but you are training only up until Jul 2018. ARIMA (or any univariate time series method for that matter) will never be able to predict that structural break. You will have to extend your training data set to include values from Aug and Sep 2018.

See the first section of this blog post to better understand why this is happening.

Alex Kinman
  • 2,437
  • 8
  • 32
  • 51