13

I have 3 months of data (each row corresponding to each day) generated and I want to perform a multivariate time series analysis for the same :

the columns that are available are -

Date    Capacity_booked Total_Bookings  Total_Searches  %Variation

Each Date has 1 entry in the dataset and has 3 months of data and I want to fit a multivariate time series model to forecast other variables as well.

So far, this was my attempt and I tried to achieve the same by reading articles.

I did the same -

df['Date'] = pd.to_datetime(Date , format = '%d/%m/%Y')

data = df.drop(['Date'], axis=1)

data.index = df.Date

from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = data
coint_johansen(johan_test_temp,-1,1).eig



#creating the train and validation set
train = data[:int(0.8*(len(data)))]
valid = data[int(0.8*(len(data))):]

freq=train.index.inferred_freq

from statsmodels.tsa.vector_ar.var_model import VAR

model = VAR(endog=train,freq=train.index.inferred_freq)
model_fit = model.fit()


# make prediction on validation
prediction = model_fit.forecast(model_fit.data, steps=len(valid))

cols = data.columns

pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
    for j in range(0,4):
        for i in range(0, len(prediction)):
           pred.iloc[i][j] = prediction[i][j]

I have a validation set and prediction set. However the predictions are way worse than expected.

The plots of the dataset are - 1. % Variation enter image description here

  1. Capacity_Booked enter image description here

  2. Total bookings and searches enter image description here

The output that I am receiving are -

Prediction dataframe -

enter image description here

Validation Dataframe -

enter image description here

As you can see that predictions are way off what is expected. Can anyone advise a way to improve the accuracy. Also, if I fit the model on whole data and then print the forecasts, it doesn't take into account that new month has started and hence to predict as such. How can that be incorporated in here. any help is appreciated.

EDIT

Link to the dataset - Dataset

Thanks

dper
  • 884
  • 1
  • 8
  • 31
  • can you post the std of classes – Swarathesh Addanki Nov 29 '19 at 22:38
  • @SwaratheshAddanki I added the link to the dataset in the question... you can take a look. – dper Dec 02 '19 at 06:23
  • You could try to use a classical machine learning algorithm using "home made" features. You could for example try to train a perceptron or a SVM or a Random Forest for a single day using the past 7 days (make one row with the 4*7 features). You could also take into account the same day of the last week (wednesday if you want to predict for wednesday) and an average of every wednesday of last month. Also use cross validation in order to have a more realist performance measurement – politinsa Dec 05 '19 at 02:20
  • @politinsa Could you share an example for the same? – dper Dec 05 '19 at 21:54
  • 1
    I believe that you don't have enough data to fit a good model: the main feature seems to be the downward jumps at the end of these month. We can only see two of these jumps in the data set, and from just two observations is will not be possible to learn much about what a typical jump looks like. Similarly, the growth during the months looks regular enough that the model could try to describe the shape of these curves, but there is little information about how much the values will grow over a typical month. Given this, "next month equals previous month" might be a good enough model? – jochen Feb 16 '20 at 11:04
  • @jochen Thank you for your reply, as per my understanding from your comment, you mean to say that I need to provide more data points(probably an year) to get good results from the algorithm? – dper Feb 16 '20 at 18:42
  • @dper Yes, that's what I think. Once you have more data, you can try to find a model (this will still take some work), but from just 3 months I don't think there is a chance to do anything clever. If you have more data, maybe a good start would be to try to just model the downward jumps at the end of every month? Once you have this, maybe the increasing bits are not so difficult to do? – jochen Feb 16 '20 at 19:41
  • @jochen I am kinda new to modelling and I wanted to ask as to how to model the downward jump? Can you suggest some links? – dper Feb 16 '20 at 19:42

1 Answers1

1

One manner to improve your accuracy is to look to the autocorrelation of each variable, as suggested in the VAR documentation page:

https://www.statsmodels.org/dev/vector_ar.html

The bigger the autocorrelation value is for a specific lag, the more useful this lag will be to the process.

Another good idea is to look to the AIC criterion and the BIC criterion to verify your accuracy (the same link above has an example of usage). Smaller values indicate that there is a bigger probability that you have found the true estimator.

This way, you can vary the order of your autoregressive model and see the one that provides the lowest AIC and BIC, both analyzed together. If AIC indicates the best model is with lag of 3 and the BIC indicates the best model has a lag of 5, you should analyze the values of 3,4 and 5 to see the one with best results.

The best scenario would be to have more data (as 3 months is not much), but you can try these approaches to see if it helps.