3

I'm building a Seq2seq neural network. It's a video to natural langage model

The problem:

My training loss decreases normaly but my testing loss increases.

Also, my training accuracy inscreases while my testing accuracy is especially low and decreases.

Epoch 1 ; Batch loss: 5.397328 ; Test loss: 5.954748 ; Batch accuracy: 51.22% ; Test accuracy: 00.86%
Epoch 2 ; Batch loss: 4.707819 ; Test loss: 6.127879 ; Batch accuracy: 52.87% ; Test accuracy: 00.86%
Epoch 3 ; Batch loss: 4.348535 ; Test loss: 6.274649 ; Batch accuracy: 54.26% ; Test accuracy: 00.86%
...
Epoch 27 ; Batch loss: 1.005701 ; Test loss: 13.232792 ; Batch accuracy: 74.86% ; Test accuracy: 00.54%
Epoch 28 ; Batch loss: 0.919244 ; Test loss: 13.706192 ; Batch accuracy: 75.82% ; Test accuracy: 00.54%
Epoch 29 ; Batch loss: 0.861092 ; Test loss: 12.981027 ; Batch accuracy: 76.34% ; Test accuracy: 00.20%
Epoch 30 ; Batch loss: 0.820653 ; Test loss: 13.580329 ; Batch accuracy: 76.79% ; Test accuracy: 00.51%
...
Epoch 265 ; Batch loss: 0.000597 ; Test loss: 24.191395 ; Batch accuracy: 84.18% ; Test accuracy: 00.27%

My code:

I've got two decoders that share the same weights.

  • One is used for training, with a TrainingHelper
  • The second is used for inference, with a GreedyEmbeddingHelper
def decoder(target, encoder_last_state, encoder_outputs, caption_lengths):
  # Size of the batch
  n_data = tf.shape(target)[0]

  with tf.name_scope("decoder"):

    # Performing embedding...
    decoder_inputs = embeddings(target) 

    decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(dec_units)    
    decoder_initial_state = encoder_last_state

    # Final dense layer. It should be common to both decoders
    output_layer = tf.layers.Dense(vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.1), name="decoder_dense")

    # Training decoder
    with tf.variable_scope("decoder"):
      training_helper = tf.contrib.seq2seq.TrainingHelper(decoder_inputs, caption_lengths)
      training_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, training_helper, decoder_initial_state, output_layer)
      training_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, maximum_iterations=max_length)

    # Inference decoder
    with tf.variable_scope("decoder", reuse=True):
      start_tokens = tf.tile(tf.constant([lang_tokenizer.word_index['<start>']], dtype=tf.int32), [n_data], name='start_tokens')

      inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings, start_tokens, lang_tokenizer.word_index['<end>'])
      inference_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, inference_helper, decoder_initial_state, output_layer)
      inference_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder, maximum_iterations=max_length)

    return training_decoder_outputs, inference_decoder_outputs

Note that:

  • target argument is the tokenized caption. It has been preprocessed as explained in Clue section at the end of the post
  • caption_lengths is a tensor that contains the effective length of all captions (explained in Clue section)

The loss computation is common for the two decoders outputs.

Loss function:

def eval_loss(logits, captions, caption_lengths):
  # Adjusting the size of logits...
  paddings = [[0, 0], [0, max_length-tf.shape(logits)[1]], [0, 0]]
  padded_logits = tf.pad(logits, paddings, 'CONSTANT', constant_values=0)

  # Create the weights for sequence_loss
  masks = tf.sequence_mask(caption_lengths, max_length, dtype=tf.float32, name='masks')
  loss = tf.contrib.seq2seq.sequence_loss(logits=padded_logits, targets=captions, weights=masks)

  return loss

Prediction example:

After 265 epochs, with a test loss equals to 24.191395, here are some predictions my testing decoder makes:

Video overview:

video33 overview

Ground-truth caption:

a female providing a show and tell of recent clothing purchases <end> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

Predicted caption:

a women with black top is applying make up <end> <pad> <end> <pad> <end> <pad> <end> <pad> 

Conclusion:

Is my model wrong or do I simply do overfitting ?

I mean, each video is composed of several hundred frames, and each frames is CNN preprocessed to output 2048 features...

I've got 10.000 videos in total but only use 512 for the moment.

wakobu
  • 318
  • 1
  • 11
  • Wilthout looking at the entire model, it would be hard to comment on it. Can you please share a github gist of your code. Also, try padding the sequences at the start. Also, how similar is your test set with your training set. I doubt the model is seeing something that it has not seen before and maybe that might be the concern. Cannot really comment on it before looking at the entire model. –  Aug 19 '19 at 14:55

0 Answers0