Reasons not to use tf.train.AdamOptmizer?

Question

I've read this article and it seems like, given enough memory, you should always use Adam over the other possible optimization algorithms (adadelta, rmsprop, vanilla sgd, etc). Are there any examples, either toy or real world, in which Adam will do significantly worse than another algorithm? I imagine for a mostly convex loss function over mostly dense inputs, you'll probably get faster convergence with vanilla SGD, but you still have to tune your learning schedule and stuff which takes some time.

I've seen people prefer momentum over `AdamOptimizer` because it worked better for sparse gradients — Yaroslav Bulatov, May 24 '16 at 19:03

score 0 · Answer 1 · answered May 24 '16 at 19:15

I tend to use vanilla sgd as long as I am still in the process of getting the general graph-layout right as ADAM and AdaGrad bring a lot of matrices-overhead with them,, making debugging really harder. But once you have your model and want to train at scale, I guess ADAM, AdaGrad and rmsprop are the choices. My personal experience is that working on seq2seq tasks AdaGrad is very efficient and stable.

score 0 · Answer 2 · answered Feb 01 '18 at 23:04

0

there is no optimal optimization method. See No free lunch theorem.

answered Feb 01 '18 at 23:04

Alexander Hamilton

409
3
14

Can you explain your reasoning, give a bit more context, and if possible give a link to the theorem you mention? – mjuarez Feb 01 '18 at 23:25

Reasons not to use tf.train.AdamOptmizer?

2 Answers2