Multivariate time series forecasting with 3 months dataset

Question

I have 3 months of data (each row corresponding to each day) generated and I want to perform a multivariate time series analysis for the same :

the columns that are available are -

Date    Capacity_booked Total_Bookings  Total_Searches  %Variation

Each Date has 1 entry in the dataset and has 3 months of data and I want to fit a multivariate time series model to forecast other variables as well.

So far, this was my attempt and I tried to achieve the same by reading articles.

I did the same -

df['Date'] = pd.to_datetime(Date , format = '%d/%m/%Y')

data = df.drop(['Date'], axis=1)

data.index = df.Date

from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = data
coint_johansen(johan_test_temp,-1,1).eig



#creating the train and validation set
train = data[:int(0.8*(len(data)))]
valid = data[int(0.8*(len(data))):]

freq=train.index.inferred_freq

from statsmodels.tsa.vector_ar.var_model import VAR

model = VAR(endog=train,freq=train.index.inferred_freq)
model_fit = model.fit()


# make prediction on validation
prediction = model_fit.forecast(model_fit.data, steps=len(valid))

cols = data.columns

pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
    for j in range(0,4):
        for i in range(0, len(prediction)):
           pred.iloc[i][j] = prediction[i][j]

I have a validation set and prediction set. However the predictions are way worse than expected.

The plots of the dataset are - 1. % Variation

Capacity_Booked
Total bookings and searches

The output that I am receiving are -

Prediction dataframe -

Validation Dataframe -

As you can see that predictions are way off what is expected. Can anyone advise a way to improve the accuracy. Also, if I fit the model on whole data and then print the forecasts, it doesn't take into account that new month has started and hence to predict as such. How can that be incorporated in here. any help is appreciated.

EDIT

Link to the dataset - Dataset

Thanks

@SwaratheshAddanki I added the link to the dataset in the question... you can take a look. — dper, Dec 02 '19 at 06:23
You could try to use a classical machine learning algorithm using "home made" features. You could for example try to train a perceptron or a SVM or a Random Forest for a single day using the past 7 days (make one row with the 4*7 features). You could also take into account the same day of the last week (wednesday if you want to predict for wednesday) and an average of every wednesday of last month. Also use cross validation in order to have a more realist performance measurement — politinsa, Dec 05 '19 at 02:20
I believe that you don't have enough data to fit a good model: the main feature seems to be the downward jumps at the end of these month. We can only see two of these jumps in the data set, and from just two observations is will not be possible to learn much about what a typical jump looks like. Similarly, the growth during the months looks regular enough that the model could try to describe the shape of these curves, but there is little information about how much the values will grow over a typical month. Given this, "next month equals previous month" might be a good enough model? — jochen, Feb 16 '20 at 11:04
@jochen Thank you for your reply, as per my understanding from your comment, you mean to say that I need to provide more data points(probably an year) to get good results from the algorithm? — dper, Feb 16 '20 at 18:42
@dper Yes, that's what I think. Once you have more data, you can try to find a model (this will still take some work), but from just 3 months I don't think there is a chance to do anything clever. If you have more data, maybe a good start would be to try to just model the downward jumps at the end of every month? Once you have this, maybe the increasing bits are not so difficult to do? — jochen, Feb 16 '20 at 19:41
@jochen I am kinda new to modelling and I wanted to ask as to how to model the downward jump? Can you suggest some links? — dper, Feb 16 '20 at 19:42

score 1 · Answer 1 · answered Apr 16 '20 at 18:27

One manner to improve your accuracy is to look to the autocorrelation of each variable, as suggested in the VAR documentation page:

https://www.statsmodels.org/dev/vector_ar.html

The bigger the autocorrelation value is for a specific lag, the more useful this lag will be to the process.

Another good idea is to look to the AIC criterion and the BIC criterion to verify your accuracy (the same link above has an example of usage). Smaller values indicate that there is a bigger probability that you have found the true estimator.

This way, you can vary the order of your autoregressive model and see the one that provides the lowest AIC and BIC, both analyzed together. If AIC indicates the best model is with lag of 3 and the BIC indicates the best model has a lag of 5, you should analyze the values of 3,4 and 5 to see the one with best results.

The best scenario would be to have more data (as 3 months is not much), but you can try these approaches to see if it helps.

Multivariate time series forecasting with 3 months dataset

1 Answers1