-2

Gradient Descent, rmsprop, adam are optimizers. Assume I have taken adam or rmsprop optimizer while compiling model i.e model.compile(optimizer = "adam").

My doubt is that, now during backpropagation, is gradient Descent is used for updating weights or Adam is used for updating weights?

pjrockzzz
  • 135
  • 1
  • 8
  • 1
    Adam, RMSprop etc are all variants/extensions/improvements of the basic ("vanilla") stochastic gradient descent (SGD) algorithm (optimizer). With `optimizer="adam"`, Adam (again, an SGD variant) will be used for the weight updates; with `optimizer="sgd"`, the vanilla SGD will be used. I kindly suggest you have a look at the relevant concepts - see for example [An overview of gradient descent optimization algorithms](https://ruder.io/optimizing-gradient-descent/index.html). – desertnaut Feb 19 '21 at 16:03

1 Answers1

0

We are using gradient descent to calculate the gradient and then update the weights by backpropagation. There are plenty optimizers, like the ones you mention and many more.

The optimizers use an adaptive learning rate. With an adaptive loss we have more DoF to increase my learning rate on y directions and decrease along the x direction. They don't stuck on one direction and they are able to traverse more on one direction against the other.

RMSprop uses a momentum-like exponential decay to the gradient history. Gradients in extreme past have less influence. It modifies AdaGrad optimizer to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.

Adam (adaptive moments) Calls the 1st and 2nd power of the gradient moments and uses a momentum-like decay on both moments. In addition, it uses bias correction to avoid initial instabilities of the moments.

How to chose one?

Depends on the problem we are trying to solve. The best algorithm is the one that can traverse the loss for that problem pretty well.

It's more empirical than mathematical

cosa__
  • 96
  • 5
  • I kindly suggest you re-read the **exact question**: with `model.compile(optimizer = "adam")`,"*is gradient Descent used for updating weights or Adam is used for updating weights?*" – desertnaut Feb 19 '21 at 14:35
  • @desertnaut Can you please clarify my doubt. – pjrockzzz Feb 19 '21 at 15:24
  • GD use the update rule w<-w - εg, where ε is learning rate . For GD it can be constant, exponential e.t.c. With this command you use adams update rule w – cosa__ Feb 19 '21 at 15:57
  • @cosa__ again, what does this have to do with the *specific question* asked by the OP? – desertnaut Feb 19 '21 at 16:04