I've read this article and it seems like, given enough memory, you should always use Adam over the other possible optimization algorithms (adadelta, rmsprop, vanilla sgd, etc). Are there any examples, either toy or real world, in which Adam will do significantly worse than another algorithm? I imagine for a mostly convex loss function over mostly dense inputs, you'll probably get faster convergence with vanilla SGD, but you still have to tune your learning schedule and stuff which takes some time.
Asked
Active
Viewed 521 times
0
-
I've seen people prefer momentum over `AdamOptimizer` because it worked better for sparse gradients – Yaroslav Bulatov May 24 '16 at 19:03
2 Answers
0
I tend to use vanilla sgd as long as I am still in the process of getting the general graph-layout right as ADAM and AdaGrad bring a lot of matrices-overhead with them,, making debugging really harder. But once you have your model and want to train at scale, I guess ADAM, AdaGrad and rmsprop are the choices. My personal experience is that working on seq2seq tasks AdaGrad is very efficient and stable.

Phillip Bock
- 1,879
- 14
- 23
0
there is no optimal optimization method. See No free lunch theorem.

Alexander Hamilton
- 409
- 3
- 14
-
Can you explain your reasoning, give a bit more context, and if possible give a link to the theorem you mention? – mjuarez Feb 01 '18 at 23:25