Keras LSTM: a time-series multi-step multi-features forecasting - poor results

Question

I have a time series dataset containing data from a whole year (date is the index). The data was measured every 15 min (during whole year) which results in 96 timesteps a day. The data is already normalized. The variables are correlated. All the variables except the VAR are weather measures.

VAR is seasonal in a day period and in a week period (as it looks a bit different on weekend, but more less the same every weekend). VAR values are stationary. I would like to predict values of VAR for next two days (192 steps ahead) and for next seven days (672 steps ahead).

Here is the sample of the dataset:

DateIdx               VAR       dewpt       hum         press       temp
2017-04-17 00:00:00   0.369397  0.155039    0.386792    0.196721    0.238889
2017-04-17 00:15:00   0.363214  0.147287    0.429245    0.196721    0.233333
2017-04-17 00:30:00   0.357032  0.139535    0.471698    0.196721    0.227778
2017-04-17 00:45:00   0.323029  0.127907    0.429245    0.204918    0.219444
2017-04-17 01:00:00   0.347759  0.116279    0.386792    0.213115    0.211111
2017-04-17 01:15:00   0.346213  0.127907    0.476415    0.204918    0.169444
2017-04-17 01:30:00   0.259660  0.139535    0.566038    0.196721    0.127778
2017-04-17 01:45:00   0.205564  0.073643    0.523585    0.172131    0.091667
2017-04-17 02:00:00   0.157650  0.007752    0.481132    0.147541    0.055556
2017-04-17 02:15:00   0.122101  0.003876    0.476415    0.122951    0.091667

Input dataset plot

I have decided to use LSTM in Keras. Having data from the whole year, I have used data from past 329 days as a training data and the rest for a validation during the training. train_X -> contains whole measures including VAR from 329 days train_Y -> contains only VAR from 329 days. The value is shifted one step ahead. The rest timesteps goes to test_X and test_Y.

Here is the code I prepare train_X and train_Y:

#X -> is the whole dataframe
#Y -> is a vector of VAR from whole dataframe, already shifted 1 step ahead

#329 * 96 = 31584
train_X = X[:31584]
train_X = train_X.reshape(train_X.shape[0],1,5)
train_Y = Y[:31584]  
train_Y = train_Y.reshape(train_Y.shape[0],1)

To predict next VAR value I would like to use past 672 timesteps (whole week measures). For this reason I have set batch_size=672, so that the ‘fit’ command look like this:

history = model.fit(train_X, train_Y, epochs=50, batch_size=672, validation_data=(test_X, test_Y), shuffle=False)

Here is the architecture of my network:

model = models.Sequential()
model.add(layers.LSTM(672, input_shape=(None, 672), return_sequences=True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(336, return_sequences=True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(168, return_sequences=True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(84, return_sequences=True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(21, return_sequences=False))
model.add(layers.Dense(1))
model.compile(loss='mae', optimizer='adam')
model.summary()

From the plot below we can see that the network has learn ‘something’ after 50 epochs:

Plot from the learning process

For the prediction purpose I have prepared a set of data containing last 672 steps with all values and 96 without VAR value – which should be predicted. I also used autoregression, so I updated VAR after each prediction and used it for next prediction.

The predX dataset (used for prediction) looks like this:

print(predX['VAR'][668:677])

DateIdx            VAR
2017-04-23 23:00:00    0.307573
2017-04-23 23:15:00    0.278207
2017-04-23 23:30:00    0.284390
2017-04-23 23:45:00    0.309118
2017-04-24 00:00:00         NaN
2017-04-24 00:15:00         NaN
2017-04-24 00:30:00         NaN
2017-04-24 00:45:00         NaN
2017-04-24 01:00:00         NaN
Name: VAR, dtype: float64

Here is the code (autoregression) I have used to predict next 96 steps:

stepsAhead = 96
historySteps = 672

for i in range(0,stepsAhead):
    j = i + historySteps
    ypred = model.predict(predX.values[i:j].reshape(1,historySteps,5))
    predX['VAR'][j] = ypred

Unfortunately the results are very poor and very far from the expectations:

A Plot with predicted data

Results combined with a previous day:

Predicted data combined with a previous day

Except from the ‘What have I done wrong‘ question, I would like to ask a few questions:

Q1. During model fifing, I have just put the whole history in batches of 672 size. Is it correct? How should I organize the dataset for the model fitting? What options do I have? Should I use the “sliding window” approach (like in the link here: https://machinelearningmastery.com/promise-recurrent-neural-networks-time-series-forecasting/ )?

Q2. Is the 50 epochs enough? What is the common practice here? Maybe the network is underfitted resulting in poor prediction? So far I tried 200 epoch with the same result.

Q3. Should I try a different architecture? Is the proposed network ‘big enough’ to handle such a data? Maybe a “stateful” network is the right approach here?

Q4. Did I implement the autoregression correctly? Is there any other approach to make a prediction for many steps ahead e.g. 192 or 672 like in my case?

@neurite `DateIdx` is not a part of `X`. I will try to encode the days of the week into X as soon as my prediction will start to looks like it should. — Bolesław Maliszewski, Jun 08 '18 at 07:10

score 8 · Accepted Answer · answered Jun 07 '18 at 14:17

It looks like there is a confusion on how to organise the data to train a RNN. So let's cover the questions:

Once you have a 2D dataset (total_samples, 5) you can use the TimeseriesGenerator to create a sliding window what will generate (batch_size, past_timesteps, 5) for you. In this case, you will use .fit_generator to train the network.
If you get the same result, 50 epochs should be fine. You usually adjust based on the performance of your network. But you should keep it fixed if you are comparing two different network architectures.
Architecture is really large as you aim to predict all 672 future values at once. You can design the network so it learns to predict one measurement at a time. At prediction time you can predict one point and feed that again to predict the next until you get 672.
This ties into answer 3, you can learn to predict one at a time and chain the predictions to n number of predictions after training.

The single point prediction model could look like:

model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(past_timesteps, 5))
model.add(LSTM(64))
model.add(Dense(1))

You are right. I have used the `TimeSeriesGenerator` together with a simpler network and predictions started to look better. Thank you. Anyway I still have a doubt about the `steps_per_epoch` parameter of a ` fit_generator` . I have used an equation: `stepspe = len(train_X) / batchsize` (where batchsize = 32). Is this a correct approach? — Bolesław Maliszewski, Jun 11 '18 at 12:56
You don't need to specify that if you are using TimeseriesGenerator, it automatically calculates it for you using the equation you have. — nuric, Jun 11 '18 at 12:58
@nuric I'm trying to solve similar problem. I have input with 3 features: `a`, `b` and `c` and want to predict `a` one step ahead. predict function returns output with one feature so how can I feed it again to predict next values if my input consists of 3 features? — mikro098, Jun 09 '20 at 21:21

neurite · Answer 2 · 2018-06-07T14:46:31.963

1) Batches are not the sequences. The input X is the sequence. The input should have the shape [None, sequence_length, number_of_features]. The 1st axis will be filled in by Keras with the batches. But they are not the sequences. The sequences are on the 2nd axis. The 3rd axis are the feature columns. Batch size 672 might be too large. You can try smaller values 128, 64, or 32.

2) Almost certain your network overfits. The network has too many LSTM layers. I would try just 2 layers of LSTM as @nuric suggested and see how it performs.

3) There also seems a confusion about the LSTM units (or LSTM size). It does not have to be 672. In fact, 672 is too large. A good starting point is 128.

4) The NN architecture is predicting a single value of VAR. In that case, make sure your Y have a single value for each sequence of X.

5) Alternatively you can make the last LSTM to output a sequence. In that case, each Y entry is a VAR sequence shifted one-step ahead. Going back to 4), make sure Y has the correct shape corresponding with that of X and the NN architecture.

6) You plot shows 50 epochs are enough for converging. Once you adjust X, Y, and the NN, do the same thing for watching the number of epochs.

7) Lastly an idea about the dates. If you want to include the dates in X, one idea is to one-hot encode them into week days. So your X would be [dewpt, hum, press, temp, MON, TUE, ..., SAT, SUN].

VAR could be part of X, so you predict future VAR values based on the past ones, I don't see why it *should not* be. — nuric, Jun 07 '18 at 14:34
Edited off. My list is too long -- when editing it, didn't see the first 2 points are incorrect:) — neurite, Jun 07 '18 at 14:42
@nuric this is exactly what I do. I use previous values X together with other features to make a prediction — Bolesław Maliszewski, Jun 08 '18 at 10:08

score 4 · Answer 3 · edited Jun 07 '18 at 15:14

Your main issue here, as stated by others, is the size of your network. LSTMs are great for learning long term dependencies but they're certainly not magic. Personally I haven't had much success with any sequences with 100+ timesteps. What you will find is you end up suffering from the 'exploding/vanishing gradients problem' because your network is too large.

I won't reiterate what others have said about reshaping your data into the proper format but once you have done that I recommend starting off small (10/15 steps) and predicting just the outcome of the next step, and build it up from there. That's not to say that you can't eventually predict a much longer sequence and farther into the future but starting small will help you understand how the RNN is behaving before building it up.

That was helpfull as well. Thank you. – Bolesław Maliszewski Jun 11 '18 at 12:57 — Bolesław Maliszewski, Jun 11 '18 at 12:57

Keras LSTM: a time-series multi-step multi-features forecasting - poor results

3 Answers3

Linked