87

When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?

For reference, I was discussing with someone, and it was said that, when batch size is increased, the learning rate should be decreased by some extent.

My understanding is when I increase batch size, computed average gradient will be less noisy and so I either keep same learning rate or increase it.

Also, if I use an adaptive learning rate optimizer, like Adam or RMSProp, then I guess I can leave learning rate untouched.

Please correct me if I am mistaken and give any insight on this.

Tanmay
  • 1,091
  • 1
  • 9
  • 15

3 Answers3

109

Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997

However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677

I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.

Dmytro Prylipko
  • 4,762
  • 2
  • 25
  • 44
  • 3
    Where do you have the argument that learning rate may stay the same if batch size does not change substantially? Have never seen this in theory or practice. – EntropicFox Mar 09 '21 at 12:06
  • 2
    Every post about deep learning (where a data set like cifar or mnist is used) should start with this phrase "Theory suggests ..." – user3352632 May 06 '21 at 10:11
45

Learning Rate Scaling for Dummies

I've always found the heuristics which seem to vary somewhere between scale with the square root of the batch size and the batch size to be a bit hand-wavy and fluffy, as is often the case in Deep Learning. Hence I devised my own theoretical framework to answer this question.

EDIT: Since the posting of this answer, my paper on this topic has been published at the journal of machine learning (https://www.jmlr.org/papers/volume23/20-1258/20-1258.pdf). I want to thank the stackoverflow community for believing in my ideas, engaging with and probing me, at a time where the research community dismissed me out of hand.

Learning Rate is a function of the Largest Eigenvalue

Let me start with two small sub-questions, which answer the main question

  • Are there any cases where we can a priori know the optimal learning rate?

Yes, for the convex quadratic, the optimal learning rate is given as 2/(λ+μ), where λ,μ represent the largest and smallest eigenvalues of the Hessian (Hessian = the second derivative of the loss ∇∇L, which is a matrix) respectively.

  • How do we expect these eigenvalues (which represent how much the loss changes along a infinitesimal move in the direction of the eigenvectors) to change as a function of batch size?

This is actually a little more tricky to answer (it is what I made the theory for in the first place), but it goes something like this.

Let us imagine that we have all the data and that would give us the full Hessian H. But now instead we only sub-sample this Hessian so we use a batch Hessian B. We can simply re-write B=H+(B-H)=H+E. Where E is now some error or fluctuation matrix.

Under some technical assumptions on the nature of the elements of E, we can assume this fluctations to be a zero mean random matrix and so the Batch Hessian becomes a fixed matrix + a random matrix.

For this model, the change in eigenvalues (which determines how large the learning rate can be) is known. In my paper there is another more fancy model, but the answer is more or less the same.

What actually happens? Experiments and Scaling Rules

I attach a plot of what happens in the case that the largest eigenvalue from the full data matrix is far outside that of the noise matrix (usually the case). As we increase the mini-batch size, the size of the noise matrix decreases and so the largest eigenvalue also decreases in size, hence larger learning rates can be used. This effect is initially proportional and continues to be approximately proportional until a threshold after which no appreciable decrease happens.

largest eigenvalue change with minibatching

How well does this hold in practice? The answer as shown below in my plot on the VGG-16 without batch norm (see paper for batch normalisation and resnets), is very well.

enter image description here

I would hasten to add that for adaptive order methods, if you use a small numerical stability constant (epsilon for Adam) the argument is a little different because you have an interplay of the eigenvalues, the estimated eigenvalues and your stability constant! so you actually end up getting a square root rule up till a threshold. Quite why nobody is discussing this or has published this result is honestly a little beyond me.

enter image description here

But if you want my practical advice, stick with SGD and just go proportional to the increase in batch size if your batch size is small and then don't increase it beyond a certain point.

EntropicFox
  • 727
  • 5
  • 11
  • 2
    Impressive answer. Do the results in https://arxiv.org/abs/2103.00065 confirm or refute your model? – DeltaIV Mar 28 '21 at 16:09
  • @DeltaIV, do you mind being more specific as to which results? Short answer: No because the "edge of stability" would change by the factors given. IMHO: I am very skeptical of any deep learning paper, which makes bold claims such as "observation X is inherently non-quadratic" followed by loose experiments (such as a few CIFAR-10 experiments, primarily validated on mean square loss for classification which nobody uses). e.g. "DNNs are inherrently different to spin glass loss surfaces" (https://arxiv.org/pdf/1803.06969.pdf) (debunked sec 8.1 https://arxiv.org/pdf/2102.06740.pdf). – EntropicFox Mar 31 '21 at 12:09
  • @DeltaIV there is some merit to be said about the idea that the inherrent curvature is not what matters, because actually the choice of learning rate influences the curvature (by stability). However as the results shown above show, once you have settled on a set of hyper-parameters which you think explore the surface well, how to scale them is well modelled by these results. – EntropicFox Mar 31 '21 at 12:11
  • 2
    "do you mind being more specific as to which results?" I was referring to this: > "at the Edge of Stability, the behavior of gradient descent on the real neural training objective is irreconcilably different from the behavior of gradient descent on the quadratic Taylor approximation: the former makes consistent (if choppy) progress, whereas the latter would diverge (and this divergence would happen quickly, as we demonstrate in Appendix D). Thus, the behavior of gradient descent at the Edge of Stability is inherently non-quadratic". – DeltaIV Apr 03 '21 at 10:11
  • Also, it seems to me that the paper explicitly cautions against inferring the training dynamics by looking at the distribution of the Hessian eigenvalues: "(Indeed, this view is arguably implicit in efforts to draw conclusions about trainability from Hessian eigenvalue spectra measured during training Ghorbani et al. (2019)". To the best of my (limited) understanding, the results in your paper are based on comparing the SD of the fluctuation matrix to the SD of the full Hessian. So, are the experimental results in https://arxiv.org/abs/2103.00065 compatible with your model or not? – DeltaIV Apr 04 '21 at 08:40
  • 1
    Thank-you @DeltaIV for your time to spell out your comments. I will do my best to answer your questions here. 1) If a function is differentiable (or in the neural network case, differentiable almost everywhere), then it can be modelled within its vicinity as an n'th order polynomial (Taylors theorem). We successfully train using gradient descent, which is first order, so within an event greater region than this we can expect a second order approximation to be justified. It makes no sense to talk about gradient descent (which is first order) whilst saying second order dynamics are invalid. – EntropicFox Apr 09 '21 at 09:21
  • 1
    My work very cleanly spells out what happens to the eigenvalues as you mini-batch. I further show that given a chosen training regimen, that this very accurately models how to scale as we sub-sample. Now the work of Wu et al, makes an interesting observation. Instead of considering that we have some approximately quadratic surface and optimising that optimally, what if we imagine we have a complicated multi modal surface and we essentially perform gradient descent based annealing? I.e bounce around the loss surface with a high learning rate and then drop it to find a "good" local minimum? – EntropicFox Apr 09 '21 at 09:23
  • 1
    @DeltaIV so in conclusion. There is no doubt that the prescriptions given by the optimization literature (how small the learning rate should be for optimal asymptotic performance) are essentially irrelevant in deep learning. Furthermore, I find that trying to "learn the learning rate" using curvature is not effective. However, there is absolutely no inconsistency in arguing that given we have settled on a learning rate regimen, that how we should alter it as we change the mini-batch can be derived (and is experimentally verified by me) by the change in curvature. – EntropicFox Apr 09 '21 at 09:25
  • 1
    Thank you for your time to answer my comments. In conclusion you say (correct me if I'm wrong): there are **two different problems**. One is how to choose the learning rate schedule (which is what Wu et al. are essentially concerned with). The other one is really the topic of this question and it is: *given a certain learning rate schedule for a fixed minibatch size*, how do we change it if we change the minibatch?. Perfect. I chose well in awarding you the bonus. – DeltaIV Apr 09 '21 at 10:09
  • PS I'd like to exchange a few words in chat... If you're ok with it, let me know when you would be available. I'll be busy from 11:30 to 12:30 and from 13:30 to 14:30 Oxford time, but other than that I'm pretty flexible. Other days would also work. – DeltaIV Apr 09 '21 at 10:09
  • @DeltaIV, yes spot on and haha funny we were both at Oxford. Its worth saying that whilst the two problems are different they aren't a million miles apart. For an attempt at learning the learning rate using curvature in a useful fashion, have a look at 5.2 here (https://openreview.net/pdf?id=86t2GlfzFo) but honestly I gave up as I think just trying a few schedules tends to work better. Perhaps others will fare better. You are more than welcome to contact me directly. – EntropicFox Apr 09 '21 at 19:55
  • Let me know how you would prefer to get in touch. – EntropicFox Apr 09 '21 at 20:04
  • 1
    @DeltaIV I have added my email into my public profile. Feel free to get in touch and schedule a call. – EntropicFox May 07 '21 at 17:01
  • This is well-aligned with the intuition. When we increase the batch size we reduce the variance of each gradient up to certain point, but once we reach almost-zero variance batch size, the norm of gradient will increase linearly as batch size, no need to increase the learning rate anymore. – Taeyeong Jeong May 18 '23 at 16:36
24

Apart from the papers mentioned in Dmytro's answer, you can refer to the article of: Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018, October). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. The authors give the mathematical and empirical foundation to the idea that the ratio of learning rate to batch size influences the generalization capacity of DNN. They show that this ratio plays a major role in the width of the minima found by SGD. The higher ratio the wider is minima and better generalization.

  • 1
    Not sure this answers the question as to how to change the learning rate with batch size? It simply states that the ratio of learning rate to batch size if larger gives greater generalization. – EntropicFox Mar 10 '21 at 15:03