How can I do prepare data for a seq2seq model?

Question

I'm building a machine translation (Eng-French) using sequence to sequence lstm model.

I've seen the keras seq2seq-lstm example and I couldn't understand how to prepare data from text, this is the for loop used for preparing data. But I couldn't understand few things in it.

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.

Why do we need three different data, encoder_input, decoder_input and decoder_ouput?

for t, char in enumerate(target_text):
    decoder_input_data[i, t, target_token_index[char]] = 1.
    if t > 0:
    # decoder_target_data will be ahead by one timestep
    # and will not include the start character.
        decoder_target_data[i, t - 1, target_token_index[char]] = 1.
         # why it's t - 1 shouldn't it be t + 1

Here it says decoder target will be ahead by one timestep, what does that mean I mean ahead wouldn't it mean "t + 1" rather than "t - 1". I've read that "we have to offset decoder_target_data by one timestep." what does that mean here?

If it's possible can you explain this for loop completely and any important points I keep in mind when preparing data for future seq2seq model? I mean how we prepare data for the model? It's confusing a lot.

score 2 · Accepted Answer · answered Nov 26 '19 at 04:53

OK, I assume you read the lines 11 through 34 ("Summary of the Algorithm"), so you know the basic idea behind this particular sequence 2 sequence model. First an encoder produces 2 "state vectors" (latent "something"). Then it is fed to a decoder, which ... Whatever, let's look at it a bit step-by-step (lines 127-132):

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

There are two "states", for LSTM, see here: https://keras.io/layers/recurrent/ under "Output shape". It is the internal state after processing the input sequence - or an array (row-wise) of states from all the sequences in the batch. The output produced is ignored. latent_dim means number of LSTM cells (line 60: it's 256) - it will determine also the sizes of the state vectors.

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

First of all, note that this model is not Sequential, it uses the functional API: https://keras.io/models/model/ - so the input is both the encoder and the decoder inputs, and the output is the decoder output.

Size of the decoder output? num_decoder_tokens is the size of the dictionary! (not the output sequence). It should produce the probability distribution of the next character i the output sequence, given the "history" and current input, but this "history" (initial internal state) is the final state of the encoder after processing the input sequence.

Note - the decoder will be initialized with the final state of the encoder, and then, after sampling each character, the modified state will be used for next inference, along with a new 'input" - a one-hot vector with the last predicted character.

NOW, to your question - I guess you want to understand, why the training data looks the way it looks.

First (lines 104-112):

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

The encoder training set consists of batches - len(input_texts) of them. The next dimension is the max sequence length, and the third one are the token ("character") indexes, with all the letters found in the text (num_encoder_tokens - English alphabet, and num_decoder_tokens - French alphabet, plus '\t' as beginning of sentence or something).

So, let's ilustrate it with strings, and then show the little difference, that is there.

Let's say, the decoder output sequence is 'Bonjour' (I don't know French, sorry), and suppose 'max_decoder_seq_length == 10'. Then,

decoder_input_data = 'Bonjour   '  # 3 spaces, to fill up to 10
decoder_output_data = 'onjour    ' # 4 spaces, to fill up to 10

But, this is not represented as a simple string - it is actually a mask - 0 means it is not this character, and 1 means - it is.

So it's more like:

decoder_input_data[0]['B'] = 1  # and decoder_input_data[0][anything_else] == 0
decoder_input_data[1]['o'] = 1  # HERE: t == 1
decoder_input_data[2]['n'] = 1
# ... 
decoder_input_data[6]['r'] = 1
decoder_input_data[7:10][' '] = 1  # the padding

And encoder must be shifted by 1 "to the left":

# for t == 0, the `decoder_output_data` is not touched (`if t > 0`)

# decoder_output_data[t-1]['o'] = 1  # t-1 == 0
decoder_output_data[0]['o'] = 1  # t == 1
decoder_output_data[1]['n'] = 1  # t == 2
decoder_output_data[2]['j'] = 1  # t == 3
# ...
decoder_output_data[6:10][' '] = 1  # output padding with spaces, longer by 1 than input padding

So, this is basically the "Why t-1" answer.

Now "why do we need 3 input data"?

Well, this is the idea of the seq2seq approach:

we need the decoder to learn to produce a correct next French character, given the previous one (and an initial state). That's why it learns from the shifted outpput sequences.

But what sequence should it produce in the first place? Well, that's what the encoder is for - it produces a single final state - everything it "remembered" from reading the input sequence. Through our training we cause this very state (2 vectors of 256 floats per sequence) to guide the decoder to produce the output sequence.

I still didn't get why we have to shift decoder output to the left. — user_12, Nov 26 '19 at 07:29
The decoder needs to learn to guess the *next* character. So, if the output phrase is "ABC", it is fed (only during training): 'A', to which it must respond 'B', 'B', to which it must respond 'C', and 'C' to which it should respond 'end of sequence'. so the inputs in this sequence are 'ABC' and the outputs 'BC'. — Tomasz Gandor, Nov 26 '19 at 09:50
Here I'm doing for character right? If I'm building seq2seq with words and my target word is this is good . In this case I have to ignore the word isn't it? — user_12, Nov 26 '19 at 10:01
Exactly, if you work with words instead of characters, you shift left by 1 word. — Tomasz Gandor, Nov 26 '19 at 10:04

How can I do prepare data for a seq2seq model?

1 Answers1