2

I'm just trying to train a LSTM to reverse a integer sequence. My approach is a modified version of this tutorial, in which he just echoes the input sequence. It goes like this:

  1. Generate a random sequence S with length R (possible values range from 0 to 99)
  2. Break the sentence above in sub-sequences of lenght L (moving window)
  3. Each sub-sequence has it's reverse as truth label

So, this will generate (R - L + 1) sub-sequences, which is a input matrix of shape (R - L + 1) x L. For example, using:

S = 1 2 3 4 5 ... 25 (1 to 25)
R = 25
L = 5 

We endup with 21 sentences:

s1 = 1 2 3 4 5, y1 = 5 4 3 2 1
s2 = 2 3 4 5 6, y2 = 6 5 4 3 2
...
s21 = 21 22 23 24 25, y21 = 25 24 23 22 21

This input matrix is then one-hot-encoded and feed to keras. Then I repeat the proccess for another sequence. The problem is that it does not converge, the accuracy is very low. What I'm doing wrong?

In the code below I use R = 500 and L = 5, which gives 496 sub-sequences, with batch_size = 16 (so we have 31 updates per 'training session'):

Here's the code:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import LSTM
from random import randint
from keras.utils.np_utils import to_categorical
import numpy as np

def one_hot_encode(sequence, n_unique=100):
    encoding = list()
    for value in sequence:
        vector = [0 for _ in range(n_unique)]
        vector[value] = 1
        encoding.append(vector)
    return np.array(encoding)

def one_hot_decode(encoded_seq):
    return [np.argmax(vector) for vector in encoded_seq]

def get_data(rows = 500, length = 5, n_unique=100):
    s = [randint(0, n_unique-1) for i in range(rows)]
    x = []
    y = []

    for i in range(0, rows-length + 1, 1):
        x.append(one_hot_encode(s[i:i+length], n_unique))
        y.append(one_hot_encode(list(reversed(s[i:i+length])), n_unique))

    return np.array(x), np.array(y)

N = 50000
LEN = 5
#ROWS = LEN*LEN - LEN + 1
TIMESTEPS = LEN
ROWS = 10000
FEATS = 10 #randint
BATCH_SIZE = 588

# fit model
model = Sequential()
model.add(LSTM(100, batch_input_shape=(BATCH_SIZE, TIMESTEPS, FEATS), return_sequences=True, stateful=True))
model.add(TimeDistributed(Dense(FEATS, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

print(model.summary())

# train LSTM
for epoch in range(N):
    # generate new random sequence
    X,y = get_data(500, LEN, FEATS)
    # fit model for one epoch on this sequence
    model.fit(X, y, epochs=1, batch_size=BATCH_SIZE, verbose=2, shuffle=False)
    model.reset_states()

# evaluate LSTM 
X,y = get_data(500, LEN, FEATS)
yhat = model.predict(X, batch_size=BATCH_SIZE, verbose=0)

# decode all pairs
for i in range(len(X)):
    print('Expected:', one_hot_decode(y[i]), 'Predicted', one_hot_decode(yhat[i]))

Thanks!

Edit: It seems like the last numbers of the sequence are being picked up:

Expected: [7, 3, 7, 7, 6] Predicted [3, 9, 7, 7, 6]
Expected: [6, 7, 3, 7, 7] Predicted [4, 6, 3, 7, 7]
Expected: [6, 6, 7, 3, 7] Predicted [4, 3, 7, 3, 7]
Expected: [1, 6, 6, 7, 3] Predicted [3, 3, 6, 7, 3]
Expected: [8, 1, 6, 6, 7] Predicted [4, 3, 6, 6, 7]
Expected: [8, 8, 1, 6, 6] Predicted [3, 3, 1, 6, 6]
Expected: [9, 8, 8, 1, 6] Predicted [3, 9, 8, 1, 6]
Expected: [5, 9, 8, 8, 1] Predicted [3, 3, 8, 8, 1]
Expected: [9, 5, 9, 8, 8] Predicted [7, 7, 9, 8, 8]
Expected: [0, 9, 5, 9, 8] Predicted [7, 9, 5, 9, 8]
Expected: [7, 0, 9, 5, 9] Predicted [5, 7, 9, 5, 9]
Expected: [1, 7, 0, 9, 5] Predicted [7, 9, 0, 9, 5]
Expected: [9, 1, 7, 0, 9] Predicted [5, 9, 7, 0, 9]
Expected: [4, 9, 1, 7, 0] Predicted [6, 3, 1, 7, 0]
Expected: [4, 4, 9, 1, 7] Predicted [4, 3, 9, 1, 7]
Expected: [0, 4, 4, 9, 1] Predicted [3, 9, 4, 9, 1]
Expected: [1, 0, 4, 4, 9] Predicted [5, 5, 4, 4, 9]
Expected: [3, 1, 0, 4, 4] Predicted [3, 3, 0, 4, 4]
Expected: [0, 3, 1, 0, 4] Predicted [3, 3, 1, 0, 4]
Expected: [2, 0, 3, 1, 0] Predicted [6, 3, 3, 1, 0]
Fernando
  • 7,785
  • 6
  • 49
  • 81
  • I don't think the LSTM layer has that capability. At least not alone. But.... why training an LSTM for that? – Daniel Möller Sep 20 '17 at 14:47
  • It's just a example to explore LSTM's. Do you think it can't learn how to reverse a sequence? – Fernando Sep 20 '17 at 14:56
  • Yes.... unless I understand it wrong, I think it can't. Reason: You want that the first step of the result be based on the last number, which it will only see at the last step. So, the first number in the output sequence is totally blind to the last. The last one can be inflienced by the first, but I don't believe the first will ever be influenced by the last. And since the sequences are random, there isn't a possible relationship to be found between numbers. – Daniel Möller Sep 20 '17 at 15:00
  • Also, you're using `stateful=True`. This is not compatible with a sliding window case. In `stateful=True`, the sequences in the second batch are exact continuations of the sequences in the first batch. But you've got overlapping parts of sequences. `stateful=True` is only necessary if you have very long sequences that you don't want to pass at once and have to divide them in smaller - but continous - ones. – Daniel Möller Sep 20 '17 at 15:02
  • I increased the number of samples and used `stateful=False', the accuracy goes to 65%. Weird. – Fernando Sep 20 '17 at 15:10
  • If you `predict` some results, can you confirm that it's going well on the last numers of each sequence and bad on the first ones? (This would probably confirm what I've said before). – Daniel Möller Sep 20 '17 at 15:12
  • Ha! Nailed it! --- That confirms the explanation. The first step of the result can't see the last numbers (they will be only seen at the last steps). Now the last steps of the result have already seen everything and can make sense of what you want. – Daniel Möller Sep 20 '17 at 15:25
  • In other kinds of problems, if you need that the last steps influence the first ones, you can wrap the LSTM layer in a `Bidirectional(LSTM(....))` wrapper. – Daniel Möller Sep 20 '17 at 15:26
  • Yep, it seems you're right! If you could post an answer with some visuals? I need to read more on this to fully understand it. – Fernando Sep 20 '17 at 15:27

1 Answers1

2

The first thing that may be causing problems to your model is using stateful=True.

This option is only necessary when you want to divide one sequence in many parts of it, such as the sequences in the second batch are the continuation of the ones in the first batch. This is useful when your sequence is long enough to cause memory issues, then you divide it.

And it will demand you to "erase memory" (called "reset states") manually after you pass the last batch of a sequence.


Now, LSTM layers are not good for that task, because they work in the following manner:

  • Start with a clear "state" (can be roughly inteprepted as a clear memory). It's a whole new sequence coming in;
  • Take the first step/element in the sequence, calculate the first result and update memory;
  • Take the second step in the sequence, calculate the second result (now with help from their own memory and the previous results) and update memory;
  • And so on until the last step.

You can see a better explanation here. This explanation focuses more exact details, but it has nice pictures such as this one:

LSTM steps illustration

Imagine a sequence with 3 elements. In this picture, X(t-1) is the first element. H(t-1) is the first result. X(t) and H(t) are the second input and output. X(t+1) and H(t+1) are the last input and output. They're processed in sequence.

So, the memory/state of the layer simply doesn't exist at the first step. As it receives the first number, it can't have the slightest idea of what the last number is, because it has never seen the last number. (Maybe if the sequences were somehow understandable, if the numbers had a relation betwen themselves, then there would be a better chance for it to output logical results, but these sequences are just random).

Now, as you approach the last number, then the layer has already built its memory of the sequence, and there is a chance that it knows what to do (because it has already seen the first numbers).

That explains your results:

  • Up to the first half of the sequence, it's trying to output numbers that it has never seen (and that don't have any logical relation betweem themselves).
  • From the central number towards the end, all the previous numbers are seen and can be predicted.

The Bidirectional layer wrapper:

If it's important that the first steps be influenced by the last ones, you will probably need the Bidirectional layer wrapper. It makes your LSTM be processed in both ways, and duplicates the number of output features. If you pass 100 cells, it will output (Batch, Steps, 200). Pretty much as having two LSTM layers, one of them reading the inputs backwards.

model.add(Bidirectional(LSTM(100, return_sequences=True), input_shape=(TIMESTEPS, FEATS)))
Daniel Möller
  • 84,878
  • 18
  • 192
  • 214