Keras, Stochastic Gradient Descent - what do parameters mean

Question

I don't know how in detail Stochastic Gradient Descent algorithm works and I don't need to know this at the moment. What I know is that it minimizes loss function by calculating gradients and going into direction of the local minimum. But I'm using Stochastic Gradient Descent as a optimizer in my project using Keras and I don't know what the parameters of this optimizer mean. Obviously, those parameters are shortly described in documentation, but it's not specific enough and I still don't understand what they mean.

So could you explain those 4 parameters:

lr: float >= 0. Learning rate.
momentum: float >= 0. Parameter that accelerates SGD in the relevant direction and dampens oscillations.
decay: float >= 0. Learning rate decay over each update.
nesterov: boolean. Whether to apply Nesterov momentum.

And how can I know how I should set them?

You are wrong about *I don't need to know this at the moment*. At least read ```BOTTOU, Léon. Stochastic gradient descent tricks. In: Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. S. 421-436.``` [link](https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/tricks-2012.pdf) — sascha, Mar 18 '18 at 17:49
If you want to learn in a practical way about this topic I can advise you “Deep Learning” by Ian Goodfellow and Yoshua Bengio — primef, Mar 18 '18 at 17:51

score 2 · Answer 1 · answered Mar 18 '18 at 17:50

The learning rate is the step size you take towards the minimum. If you use a large learning rate you have the risk that you will overshoot the minimum. If you choose it to small it will take a long time to reach the minimum. A good starting point for the learning rate is 0.01 and increase it like 0.03, 0.1, 0.3 and so on. The decay instead is by how much the learning rate should be decreased over time. The reason behind it, is at the beginning of your training you may want a big learning rate to fastly get AROUND the minimum. After that you want a smaller learning rate to precisely get to the minimum.

Am sorry but don’t know much about the other two, although my text was too long to be inserted as a comment.

Keras, Stochastic Gradient Descent - what do parameters mean

1 Answers1