I have been training the textsum seq2seq w/attention model for abstractive summarization on a training corpus of 600k articles + abstracts. Can this be regarded convergence? If so, can it be right that it converged after less than say 5k steps? Considerations:
- I've trained on a vocab size of 200k
- 5k steps (until approx convergence) with a batch size of 4 means that at most 20k different samples were seen. This is only a fraction of the entire training corpus.
Or am I actually not reading my dog's face in the tea leaves and is the marginal negative slope as expected?