7

I was following the Keras Seq2Seq tutorial, and wit works fine. However, this is a character-level model, and I would like to adopt it to a word-level model. The authors even include a paragraph with require changes but all my current attempts result in an error regarding wring dimensions.

If you follow the character-level model, the input data is of 3 dims: #sequences, #max_seq_len, #num_char since each character is one-hot encoded. When I plot the summary for the model as used in the tutorial, I get:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None, 71)     0                                            
_____________________________________________________________________________ __________________
input_2 (InputLayer)            (None, None, 94)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 335872      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  359424      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
                                                                 lstm_1[0][2]                     
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 94)     24158       lstm_2[0][0]                     
==================================================================================================

This compiles and trains just fine.

Now this tutorial has section "What if I want to use a word-level model with integer sequences?" And I've tried to follow those changes. Firstly, I encode all sequences using a word index. As such, the input and target data is now 2 dims: #sequences, #max_seq_len since I no longer one-hot encode but use now Embedding layers.

encoder_input_data_train.shape   =>  (90000, 9)
decoder_input_data_train.shape   =>  (90000, 16)
decoder_target_data_train.shape  =>  (90000, 16)

For example, a sequence might look like this:

[ 826.  288. 2961. 3127. 1260. 2108.    0.    0.    0.]

When I use the listed code:

# encoder
encoder_inputs = Input(shape=(None, ))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# decoder
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

the model compiles and looks like this:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_35 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
input_36 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_32 (Embedding)        (None, None, 256)    914432      input_35[0][0]                   
__________________________________________________________________________________________________
embedding_33 (Embedding)        (None, None, 256)    914432      input_36[0][0]                   
__________________________________________________________________________________________________
lstm_32 (LSTM)                  [(None, 256), (None, 525312      embedding_32[0][0]               
__________________________________________________________________________________________________
lstm_33 (LSTM)                  (None, None, 256)    525312      embedding_33[0][0]               
                                                                 lstm_32[0][1]                    
                                                                 lstm_32[0][2]                    
__________________________________________________________________________________________________
dense_21 (Dense)                (None, None, 3572)   918004      lstm_33[0][0]                    

While compile works, training

model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=32, epochs=1, validation_split=0.2)

fails with the following error: ValueError: Error when checking target: expected dense_21 to have 3 dimensions, but got array with shape (90000, 16) with the latter being the shape of the decoder input/target. Why does the Dense layer an array of the shape of the decoder input data?

Things I've tried:

  • I find it a bit strange that the decoder LSTM has a return_sequences=True since I thought I cannot give a sequences to a Dense layer (and the decoder of the original character-level model does not state this). However, simply removing or setting return_sequences=False did not help. Of course, the Dense layer now has an output shape of (None, 3572).
  • I don' quite get the need for the Input layers. I've set them to shape=(max_input_seq_len, ) and shape=(max_target_seq_len, ) respectively so that the summary doesn't show (None, None) but the respective values, e.g., (None, 16). No change.
  • In the Keras Docs I've read that an Embedding layer should be used with input_length, otherwise a Dense layer upstream cannot compute its outputs. But again, still errors when I set input_length accordingly.

I'm a bit at a deadlock right? Am I even on the right track or do I missing something more fundamentally. Is the shape of my data wrong? Why does the last Dense layer get array with shape (90000, 16)? That seems rather off.

UPDATE: I figured out that the problem seems to be decoder_target_data which currently has the shape (#sample, max_seq_len), e.g., (90000, 16). But I assume I need to one-hot encode the target output with respect to the vocabulary: (#sample, max_seq_len, vocab_size), e.g., (90000, 16, 3572).

Unfortunately, this throws a Memory error. However, when I do for debugging purposes, i.e., assume a vocabulary size of 10:

decoder_target_data = np.zeros((len(input_sequences), max_target_seq_len, 10), dtype='float32')

and later in the decoder model:

x = Dense(10, activation='softmax')(x)

then the model trains without error. In case that's indeed my issue, I have to train the model with manually generate batches so I can keep the vocabulary size but reduce the #samples, e.g., to 90 batches each of shape (1000, 16, 3572). Am I on the right track here?

Christian
  • 3,239
  • 5
  • 38
  • 79
  • 1
    Yes, the array passed to the dense layer is wrong. It should be `(batchsize, num_encoder_tokens, latent_dim)`. Something wrong with embedding perhaps. Can you try `encoder_inputs = Input(num_encoder_tokens, ))` and `x = Embedding(vocab_size, latent_dim)(encoder_inputs)`?? – Littleone Feb 11 '18 at 13:02
  • `num_decoder_tokens` is my `vocab_size`, which is 3,572 in my case. According to the original tutorial `num_decoder_tokens` is the number of unique output tokens. Regarding input shape, I've tried `encoder_inputs = Input(shape=(max_input_seq_len, ))` since I assume the value reflect the length of the sequence. I just don't understand why the dense layer (or any layer for that matter) would ever see an array of (90000, 16) which is the shape of the `decoder_input_data`. The dense layer is clearly connected to a LSTM layer which outputs 33 dims. – Christian Feb 12 '18 at 11:09
  • Oh sorry, totally missed that you don't have a [time distributed dense](https://keras.io/layers/wrappers/) layer. Don't one hot encode. Use embeddings and turn your dense layer into time distributed dense. Means each timestep has its own dense. Otherwise, you'd be getting one single output token which apparently keras doesn't even allow. I have feeling there's something else missing too.. have a go. – Littleone Feb 12 '18 at 15:23
  • `x = TimeDistributed(Dense(num_decoder_tokens, activation='softmax'))(x)` had no effect. The decoder uses an embedding layer for `decoder_input_data`, but how would this work for `decoder_target_data`. I've checked the Pytorch Seq2Seq tutorial. At `loss += self.criterion(decoder_output, target_variable[di])` when I check the dimensions that the `decoder_output` is (1, #vocab_size) and target_variable[di] is of size 1. I assume that Keras needs the `target_variable[di]` of (1, #vocab_size) explicitly; Pytorch does it more cleverly. Hence, I think I need to one-hot encode decoder_target_data. – Christian Feb 13 '18 at 02:30
  • Keras tutorial weirdly doesn't talk about some important tools for working with RNNs, I assumed it used/mentioned `sparse_categorical_crossentropy` loss. Can skip one hot encoding targets. – Littleone Feb 13 '18 at 03:38
  • Did you ever figure it out @Christian? @Littleone how do you do it without one-hot encoding? – user2258651 Apr 17 '18 at 04:06
  • 1
    @user2258651 No, I haven't. I usually work with Pytorch (which handles this more neatly) so I didn't spend to much time over it. What I ended up doing was split the training data into chunks so that each chunk fit into the memory, and of course looped over all chunks. – Christian Apr 17 '18 at 05:34
  • It seems the way to get the model to compile and train is to use sparse_categorical_crossentropy as the loss with a TimeDistributed(Dense) layer and to do np.expand_dims(data,-1) on the decoder target data. However I now find myself with an issue that loss = nan – user2258651 Apr 17 '18 at 15:26
  • 1
    Was there any update on this? Did one-hot encoding and time-shifting the `decoder_target_data` work? It would be awesome if the Keras team could do a complete working example at https://github.com/keras-team/keras/blob/master/examples/ – Nic Cottrell Aug 17 '18 at 06:17
  • Any update on this>? @user2258651, @Christian? I have the same issue and asked the question one week ago but did not get any answr yet! – sariii Jul 04 '19 at 02:44
  • Any update on this? I have a similar error. What have you done to fix it? – user_12 Dec 05 '19 at 10:59

1 Answers1

0

Recently I was also facing this problem. There is no other solution then creating small batches say batch_size=64 in a generator and then instead of model.fit do model.fit_generator. I have attached my generate_batch code below:

def generate_batch(X, y, batch_size=64):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_encoder_seq_length),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_decoder_seq_length+2),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_decoder_seq_length+2, num_decoder_tokens),dtype='float32')

            for i, (input_text_seq, target_text_seq) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word_index in enumerate(input_text_seq):
                    encoder_input_data[i, t] = word_index # encoder input seq

                for t, word_index in enumerate(target_text_seq):
                    decoder_input_data[i, t] = word_index
                    if (t>0)&(word_index<=num_decoder_tokens):
                        decoder_target_data[i, t-1, word_index-1] = 1.

            yield([encoder_input_data, decoder_input_data], decoder_target_data)

And then training like this:

batch_size = 64
epochs = 2

# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit_generator(
    generator=generate_batch(X=X_train_sequences, y=y_train_sequences, batch_size=batch_size),
    steps_per_epoch=math.ceil(len(X_train_sequences)/batch_size),
    epochs=epochs,
    verbose=1,
    validation_data=generate_batch(X=X_val_sequences, y=y_val_sequences, batch_size=batch_size),
    validation_steps=math.ceil(len(X_val_sequences)/batch_size),
    workers=1,
    )

X_train_sequences is list of lists like [[23,34,56], [2, 33544, 6, 10]].
Similarly others.

Also took help from this blog - word-level-english-to-marathi-nmt

Abhilash Awasthi
  • 782
  • 5
  • 22