Extremely poor prediction: LSTM time-series

Question

I tried to implement LSTM model for time-series prediction. Below is my trial code. This code runs without error. You can also try it without dependency.

import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense, TimeDistributed, Bidirectional
from sklearn.metrics import mean_squared_error, accuracy_score
from scipy.stats import linregress
from sklearn.utils import shuffle

fi = 'pollution.csv'
raw = pd.read_csv(fi, delimiter=',')
raw = raw.drop('Dates', axis=1)
print (raw.shape)

scaler = MinMaxScaler(feature_range=(-1, 1))
raw = scaler.fit_transform(raw)

time_steps = 7
def create_ds(data, t_steps):
    data = pd.DataFrame(data)
    data_s = data.copy()
    for i in range(time_steps):
        data = pd.concat([data, data_s.shift(-(i+1))], axis = 1)   
    data.dropna(axis=0, inplace=True)
    return data.values

ds = create_ds(raw, time_steps)
print (ds.shape)
n_feats = raw.shape[1]
n_obs = time_steps * n_feats

n_rows = ds.shape[0]
train_size = int(n_rows * 0.8)

train_data = ds[:train_size, :]
train_data = shuffle(train_data)

test_data = ds[train_size:, :]

x_train = train_data[:, :n_obs]
y_train = train_data[:, n_obs:]
x_test = test_data[:, :n_obs]
y_test = test_data[:, n_obs:]

x_train = x_train.reshape(1, x_train.shape[0], x_train.shape[1])
y_train = y_train.reshape(1, y_train.shape[0], y_train.shape[1])
x_test = x_test.reshape(1, x_test.shape[0], x_test.shape[1])

print (x_train.shape)
print (y_train.shape)
print (x_test.shape)
print (y_test.shape)

model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(None, x_train.shape[2]), stateful=True, batch_size=1))
model.add(LSTM(32, return_sequences=True, stateful=True))
model.add(LSTM(n_feats, return_sequences=True, stateful=True)) 

model.compile(loss='mse', optimizer='rmsprop')
model.fit(x_train, y_train, epochs=10, batch_size=1, verbose=2)  
y_predict = model.predict(x_test)
y_predict = y_predict.reshape(y_predict.shape[1], y_predict.shape[2])

y_predict = scaler.inverse_transform(y_predict)

y_test = scaler.inverse_transform(y_test)
y_test = y_test[:,0]
y_predict = y_predict[:,0]

print (y_test.shape)
print (y_predict.shape)

plt.plot(y_test, label='True')
plt.plot(y_predict,  label='Predict')
plt.legend()
plt.show()

However, prediction is extremely poor. How to improve the predictin? Do you have any ideas to improve it?

Any ideas for improving prediction by re-designing architecture and/or layers?

The data looks pretty random. Perhaps this is the best that the LSTM can do without overfitting. A good rule of thumb is that if you can't predict the data yourself, you shouldn't expect a neural network to be able to do it. — Primusa, Apr 27 '18 at 02:50
The prediction seems quite good, actually... unless there is some rule about the period of the oscillations, then you could capture that period with a more powerful model. But if the period doesn't follow any pattern, then this is a good prediction. — Daniel Möller, Apr 27 '18 at 07:36

Daniel Möller · Accepted Answer · 2018-04-27T18:04:22.847

If you want to use the model in my code (the link you passed), you need to have the data correctly shaped: (1 sequence, total_time_steps, 5 features)

Important: I don't know if this is the best way or the best model to do this, but this model is predicting 7 time steps ahead of the input (time_shift=7)

Data and initial vars

    fi = 'pollution.csv'
raw = pd.read_csv(fi, delimiter=',')
raw = raw.drop('Dates', axis=1)
print("raw shape:")
print (raw.shape)
#(1789,5) - 1789 time steps / 5 features

scaler = MinMaxScaler(feature_range=(-1, 1))
raw = scaler.fit_transform(raw)

time_shift = 7 #shift is the number of steps we are predicting ahead
n_rows = raw.shape[0] #n_rows is the number of time steps of our sequence
n_feats = raw.shape[1]
train_size = int(n_rows * 0.8)


#I couldn't understand how "ds" worked, so I simply removed it because in the code below it's not necessary

#getting the train part of the sequence
train_data = raw[:train_size, :] #first train_size steps, all 5 features
test_data = raw[train_size:, :] #I'll use the beginning of the data as state adjuster


#train_data = shuffle(train_data) !!!!!! we cannot shuffle time steps!!! we lose the sequence doing this

x_train = train_data[:-time_shift, :] #the entire train data, except the last shift steps 
x_test = test_data[:-time_shift,:] #the entire test data, except the last shift steps
x_predict = raw[:-time_shift,:] #the entire raw data, except the last shift steps

y_train = train_data[time_shift:, :] 
y_test = test_data[time_shift:,:]
y_predict_true = raw[time_shift:,:]

x_train = x_train.reshape(1, x_train.shape[0], x_train.shape[1]) #ok shape (1,steps,5) - 1 sequence, many steps, 5 features
y_train = y_train.reshape(1, y_train.shape[0], y_train.shape[1])
x_test = x_test.reshape(1, x_test.shape[0], x_test.shape[1])
y_test = y_test.reshape(1, y_test.shape[0], y_test.shape[1])
x_predict = x_predict.reshape(1, x_predict.shape[0], x_predict.shape[1])
y_predict_true = y_predict_true.reshape(1, y_predict_true.shape[0], y_predict_true.shape[1])

print("\nx_train:")
print (x_train.shape)
print("y_train")
print (y_train.shape)
print("x_test")
print (x_test.shape)
print("y_test")
print (y_test.shape)

Model

Your model wasn't very powerful for this task, so I tried a bigger one (this on the other hand is too powerful)

model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(None, x_train.shape[2])))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(256, return_sequences=True))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(n_feats, return_sequences=True)) 

model.compile(loss='mse', optimizer='adam')

Fitting

Notice that I had to train 2000+ epochs for the model to have good results.
I added the validation data so we can compare the loss for train and test.

#notice that I'm predicting from the ENTIRE sequence, including x_train      
#is important for the model to adjust its states before predicting the end
model.fit(x_train, y_train, epochs=1000, batch_size=1, verbose=2, validation_data=(x_test,y_test))

Predicting

Important: as for predicting the end of a sequence based on the beginning, it's important that the model sees the beginning to adjust the internal states, so I'm predicting the entire data (x_predict), not only the test data.

y_predict_model = model.predict(x_predict)

print("\ny_predict_true:")
print (y_predict_true.shape)
print("y_predict_model: ")
print (y_predict_model.shape)


def plot(true, predicted, divider):

    predict_plot = scaler.inverse_transform(predicted[0])
    true_plot = scaler.inverse_transform(true[0])

    predict_plot = predict_plot[:,0]
    true_plot = true_plot[:,0]

    plt.figure(figsize=(16,6))
    plt.plot(true_plot, label='True',linewidth=5)
    plt.plot(predict_plot,  label='Predict',color='y')

    if divider > 0:
        maxVal = max(true_plot.max(),predict_plot.max())
        minVal = min(true_plot.min(),predict_plot.min())

        plt.plot([divider,divider],[minVal,maxVal],label='train/test limit',color='k')

    plt.legend()
    plt.show()

test_size = n_rows - train_size
print("test length: " + str(test_size))

plot(y_predict_true,y_predict_model,train_size)
plot(y_predict_true[:,-2*test_size:],y_predict_model[:,-2*test_size:],test_size)

Showing entire data

Showing the end portion of it for more detail

Please notice that this model is overfitting, it means it can learn the training data and get bad results in test data.

To solve this you must experimentally try smaller models, use dropout layers and other techniques to prevent overfitting.

Notice also that this data very probably contains A LOT of random factors, meaning the models will not be able to learn anything useful from it. As you make smaller models to avoid overfitting, you may also find that the model will present worse predictions for training data.

Finding the perfect model is not an easy task, it's an open question and you must experiment. Maybe LSTM models simply aren't the solution. Maybe your data is simply not predictable, etc. There isn't a definitive answer for this.

How to know the model is good

With the validation data in training, you can compare loss for train and test data.

Train on 1 samples, validate on 1 samples
Epoch 1/1000
9s - loss: 0.4040 - val_loss: 0.3348
Epoch 2/1000
4s - loss: 0.3332 - val_loss: 0.2651
Epoch 3/1000
4s - loss: 0.2656 - val_loss: 0.2035
Epoch 4/1000
4s - loss: 0.2061 - val_loss: 0.1696
Epoch 5/1000
4s - loss: 0.1761 - val_loss: 0.1601
Epoch 6/1000
4s - loss: 0.1697 - val_loss: 0.1476
Epoch 7/1000
4s - loss: 0.1536 - val_loss: 0.1287
Epoch 8/1000
.....

Both should go down together. When the test data stops going down, but the train data continues to improve, your model is starting to overfit.

Trying another model

The best I could do (but I didn't really try much) was using this model:

model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(None, x_train.shape[2])))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(n_feats, return_sequences=True)) 

model.compile(loss='mse', optimizer='adam')

When the losses were about:

loss: 0.0389 - val_loss: 0.0437

After this point, the validation loss started going up (so training beyond this point is totally useless)

Result:

This shows that all this model could learn was very overall behaviour, such as zones with higher values.

But the high frequency was either too random or the model wasn't good enough for this...

What the `create_ds` does is, it uses all 7 variables (`t-7`, `t-6`, `t-5`, `t-4`, `t-3`, `t-2`, `t-1`) for 5 features. So, 7*5=35 total features are fed into X (train_x or test_x), while 5 features are fed into Y (train_y or test_y). In your answer, you are using only `t-7` variables as X. Could you somehow adjust 35 features in your answer? — , Apr 27 '18 at 21:58
@hiker, I'm afraid I can't. I'm not really a machine learning expert, you know.... I only "use keras well". — Daniel Möller, Apr 28 '18 at 16:35
@DanielMöller I think, one biggest source of bias in this program, that is `raw = scaler.fit_transform(raw)`. It scales both the train and test data together which makes bias in the prediction. How do you think? — Roman, May 03 '18 at 10:32
I think this data is quite random and there isn't much that can be done. — Daniel Möller, May 03 '18 at 12:15

score 4 · Answer 2 · 2018-04-28T08:00:28.237

you may consider changing your model:

import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense, TimeDistributed, Bidirectional
from sklearn.metrics import mean_squared_error, accuracy_score
from scipy.stats import linregress
from sklearn.utils import shuffle

fi = 'pollution.csv'
raw = pd.read_csv(fi, delimiter=',')
raw = raw.drop('Dates', axis=1)
print (raw.shape)

scaler = MinMaxScaler(feature_range=(-1, 1))
raw = scaler.fit_transform(raw)

time_steps = 7
def create_ds(data, t_steps):
    data = pd.DataFrame(data)
    data_s = data.copy()
    for i in range(time_steps):
        data = pd.concat([data, data_s.shift(-(i+1))], axis = 1)   
    data.dropna(axis=0, inplace=True)
    return data.values

ds = create_ds(raw, time_steps)
print (ds.shape)
n_feats = raw.shape[1]
n_obs = time_steps * n_feats

n_rows = ds.shape[0]
train_size = int(n_rows * 0.8)

train_data = ds[:train_size, :]
train_data = shuffle(train_data)

test_data = ds[train_size:, :]

x_train = train_data[:, :n_obs]
y_train = train_data[:, n_obs:]
x_test = test_data[:, :n_obs]
y_test = test_data[:, n_obs:]

print (x_train.shape)
print (x_test.shape)
print (y_train.shape)
print (y_test.shape)

x_train = x_train.reshape(x_train.shape[0], time_steps, n_feats)
x_test = x_test.reshape(x_test.shape[0], time_steps, n_feats)

print (x_train.shape)
print (x_test.shape)
print (y_train.shape)
print (y_test.shape)

model = Sequential()
model.add(LSTM(64, input_shape=(time_steps, n_feats), return_sequences=True))
model.add(LSTM(32, return_sequences=False))
model.add(Dense(n_feats))

model.compile(loss='mse', optimizer='rmsprop')
model.fit(x_train, y_train, epochs=10, batch_size=1, verbose=1, shuffle=False)

y_predict = model.predict(x_test)
print (y_predict.shape)
y_predict = scaler.inverse_transform(y_predict)

y_test = scaler.inverse_transform(y_test)
y_test = y_test[:,0]
y_predict = y_predict[:,0]

print (y_test.shape)
print (y_predict.shape)

plt.plot(y_test, label='True')
plt.plot(y_predict,  label='Predict')
plt.legend()
plt.show()

But I really do not know merits of your implementation:

* both x and y are 3d (1,steps,features) rather than x in 3d (samples, time-steps, features) and y in 2d (samples, features)
* input_shape=(None, x_train.shape[2])
* last layer - model.add(LSTM(n_feats, return_sequences=True, stateful=True))

Someone may provide better answer.

I followed the code from @Daniel Möller thinking it has merits. https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb — Roman, Apr 27 '18 at 05:00
@hiker, I'm taking a look at your code, and there are very important differences that make it not behave as in my code. 1 - x_train contains 35 features (it should contain only 5), 2 - it seems you're shuffling the data, so you lose the order of the steps, 3 - you're training a stateful=True model without resetting states (notice that in my code, the first model is not stateful, only the second - the purpose of the second model is to infinitely output 1 step and take this step as input, and I'm not training the second model) -- These differences certainly make everything different. — Daniel Möller, Apr 27 '18 at 11:48
Now, obviously, there aren't rules like "your data should be like mine", but your model must certainly be adjusted to your data. My model isn't. — Daniel Möller, Apr 27 '18 at 11:49
About x and y in 3D. This answer is also 3D (this is a keras rule, and it's impossible to train LSTMs with data that is not 3D). -- The `input_shape=(None,features)` means you can input any length in time steps. (You don't need exactly 7) -- Another difference: having length 7 implies that you're training small time windows, while my model is suited for training the whole sequence at once. --- Finally, about the LSTM instead of a Dense, this is a possibility in the model design (that can have whichever layers you want), which is better, I don't know, testing may answer. — Daniel Möller, Apr 27 '18 at 11:52

score 3 · Answer 3 · answered Jun 03 '19 at 17:07

Reading the original code, it seems the author first scales the dataset and then splits it up into Training and Testing subsets. This means that information about the Testing subset (e.g., volatility etc.) has "leaked" into the Training subset.

The recommended approach is to first perform the Training/Testing split up, calculate the scaling parameters using only the Training subset, and using these parameters perform the scaling of the Training and the Testing subsets separately.

score 2 · Answer 4 · answered Apr 27 '18 at 17:36

2

I’m not exactly sure what you could do, that data looks as if it has no discernible pattern. If I can’t see one I doubt an LSTM could. Your prediction does look like a good regression line though.

answered Apr 27 '18 at 17:36

rmcwhorter99

61
1
5

2

Although this is a good general idea, and you can very probably be right, one of the great features of using neural networks is exactly to find patterns that perhaps our mind can't. --- That doesn't mean that every data has such pattern, though. – Daniel Möller Apr 27 '18 at 18:10

score 1 · Answer 5 · answered Apr 08 '21 at 09:26

1

I am at a point myself with creating a model that predicts data like this I created a SMOTErnn soultion to add as past data, and I have found using TimeSeriesGenrator on batch_size higher with higher strides it performs much bettter.

answered Apr 08 '21 at 09:26

Marcus Rose

61
5