What kind of training method you may recommend for training an attention based sequence to sequence neural machine translation model? SGD, Adadelta, Adam or something better? Please give some advice, thanks.
Asked
Active
Viewed 360 times
1 Answers
0
Use an adaptive gradient algorithm like Adam or Adadelta or RMSProp. I tend to use Adam, and always in combination with clipped gradients.
Adaptive gradient algorithms have learning rates for each parameter. This is very helpful when you have models where some parameters might be more sparse (increase its learning rate) or not sparse (decrease its learning rate). If you are working with something like neural machine translation, this sparsity is an issue. Adam is a bit more computationally expensive I suppose but gives good results.

AlexDelPiero
- 289
- 1
- 8