1

I am training a conversational agent using LSTM and tensorflow's translation model. I use batchwise training, resulting in a significant drop in the training data perplexity after each epoch start. This drop can be explained by the way I read data into batches, as I guarantee that every training pair in my training data is processed exactly once every epoch. When a new epoch starts, the improvements done by the model in the previous epochs will show its profit as it encounters the training data once more, represented as a drop in the graph. Other batchwise approaches such as the one used in tensorflow's translation model, will not lead to the same behavior, as their methodology is to load the entire training data into memory, and randomly pick samples from it.

enter image description here

Step, Perplexity

  • 330000, 19.36
  • 340000, 19.20
  • 350000, 17.79
  • 360000, 17.79
  • 370000, 17.93
  • 380000, 17.98
  • 390000, 18.05
  • 400000, 18.10
  • 410000, 18.14
  • 420000, 18.07
  • 430000, 16.48
  • 440000, 16.75

(A small snipped from the perplexity showing a drop at 350000 and 430000. Between the drops, the perplexity is slightly rising)

However, my question is regarding the trend after the drop. From the graph, it is clearly that the perplexity is slightly rising (for every epoch after step ~350000), until the next drop. Can someone give an answer or theory for why this is happening?

simejo
  • 103
  • 1
  • 8

1 Answers1

0

That would be typical of overfitting.

tnarik
  • 149
  • 1
  • 11