I'm building a Seq2seq neural network. It's a video to natural langage model
The problem:
My training loss decreases normaly but my testing loss increases.
Also, my training accuracy inscreases while my testing accuracy is especially low and decreases.
Epoch 1 ; Batch loss: 5.397328 ; Test loss: 5.954748 ; Batch accuracy: 51.22% ; Test accuracy: 00.86%
Epoch 2 ; Batch loss: 4.707819 ; Test loss: 6.127879 ; Batch accuracy: 52.87% ; Test accuracy: 00.86%
Epoch 3 ; Batch loss: 4.348535 ; Test loss: 6.274649 ; Batch accuracy: 54.26% ; Test accuracy: 00.86%
...
Epoch 27 ; Batch loss: 1.005701 ; Test loss: 13.232792 ; Batch accuracy: 74.86% ; Test accuracy: 00.54%
Epoch 28 ; Batch loss: 0.919244 ; Test loss: 13.706192 ; Batch accuracy: 75.82% ; Test accuracy: 00.54%
Epoch 29 ; Batch loss: 0.861092 ; Test loss: 12.981027 ; Batch accuracy: 76.34% ; Test accuracy: 00.20%
Epoch 30 ; Batch loss: 0.820653 ; Test loss: 13.580329 ; Batch accuracy: 76.79% ; Test accuracy: 00.51%
...
Epoch 265 ; Batch loss: 0.000597 ; Test loss: 24.191395 ; Batch accuracy: 84.18% ; Test accuracy: 00.27%
My code:
I've got two decoders that share the same weights.
- One is used for training, with a TrainingHelper
- The second is used for inference, with a GreedyEmbeddingHelper
def decoder(target, encoder_last_state, encoder_outputs, caption_lengths):
# Size of the batch
n_data = tf.shape(target)[0]
with tf.name_scope("decoder"):
# Performing embedding...
decoder_inputs = embeddings(target)
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(dec_units)
decoder_initial_state = encoder_last_state
# Final dense layer. It should be common to both decoders
output_layer = tf.layers.Dense(vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.1), name="decoder_dense")
# Training decoder
with tf.variable_scope("decoder"):
training_helper = tf.contrib.seq2seq.TrainingHelper(decoder_inputs, caption_lengths)
training_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, training_helper, decoder_initial_state, output_layer)
training_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, maximum_iterations=max_length)
# Inference decoder
with tf.variable_scope("decoder", reuse=True):
start_tokens = tf.tile(tf.constant([lang_tokenizer.word_index['<start>']], dtype=tf.int32), [n_data], name='start_tokens')
inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings, start_tokens, lang_tokenizer.word_index['<end>'])
inference_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, inference_helper, decoder_initial_state, output_layer)
inference_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder, maximum_iterations=max_length)
return training_decoder_outputs, inference_decoder_outputs
Note that:
target
argument is the tokenized caption. It has been preprocessed as explained in Clue section at the end of the postcaption_lengths
is a tensor that contains the effective length of all captions (explained in Clue section)
The loss computation is common for the two decoders outputs.
Loss function:
def eval_loss(logits, captions, caption_lengths):
# Adjusting the size of logits...
paddings = [[0, 0], [0, max_length-tf.shape(logits)[1]], [0, 0]]
padded_logits = tf.pad(logits, paddings, 'CONSTANT', constant_values=0)
# Create the weights for sequence_loss
masks = tf.sequence_mask(caption_lengths, max_length, dtype=tf.float32, name='masks')
loss = tf.contrib.seq2seq.sequence_loss(logits=padded_logits, targets=captions, weights=masks)
return loss
Prediction example:
After 265 epochs, with a test loss equals to 24.191395, here are some predictions my testing decoder makes:
Video overview:
Ground-truth caption:
a female providing a show and tell of recent clothing purchases <end> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Predicted caption:
a women with black top is applying make up <end> <pad> <end> <pad> <end> <pad> <end> <pad>
Conclusion:
Is my model wrong or do I simply do overfitting ?
I mean, each video is composed of several hundred frames, and each frames is CNN preprocessed to output 2048 features...
I've got 10.000 videos in total but only use 512 for the moment.