Training Method Choice for seq2seq model

Question

What kind of training method you may recommend for training an attention based sequence to sequence neural machine translation model? SGD, Adadelta, Adam or something better? Please give some advice, thanks.

score 0 · Answer 1 · answered Oct 04 '17 at 22:18

Use an adaptive gradient algorithm like Adam or Adadelta or RMSProp. I tend to use Adam, and always in combination with clipped gradients.

Adaptive gradient algorithms have learning rates for each parameter. This is very helpful when you have models where some parameters might be more sparse (increase its learning rate) or not sparse (decrease its learning rate). If you are working with something like neural machine translation, this sparsity is an issue. Adam is a bit more computationally expensive I suppose but gives good results.

Training Method Choice for seq2seq model

1 Answers1