3

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?

Using Batch-size 1000 (Acc = 0.97600)

Using Batch-size 10 (Acc = 0.97599)

Although, the difference is very small, why is there even a difference? EDIT - I have found that the difference is only because of precision issues and they are in fact equal.

Madhav Thakker
  • 107
  • 1
  • 10
  • @pouyan answer is correct in general, but in your specific case I think it's only a random fluctuation given by how Stochastic Gradient Descent works. – marco romelli Apr 03 '19 at 07:48
  • There are literally *dozens* of sources of inherent randomness in the whole procedure of model building (random weight initialization, shuffling, assignment of samples to batches etc); in your case, the real surprise is rather that the two values are practically equal, not that they are not. Just try to repeat the procedure for the two batch sizes, to see if you will come up with the same results above (hint: most probably you will not)... – desertnaut Apr 03 '19 at 10:43

3 Answers3

6

That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:

Batch size is a slider on the learning process.

  1. Small values give a learning process that converges quickly at the cost of noise in the training process.
  2. Large values give a learning process that converges slowly with accurate estimates of the error gradient.

and also one important note from that link is :

The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller

Which is the result of this paper.

EDIT

I should mention two more points Here:

  1. because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
  2. On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
desertnaut
  • 57,590
  • 26
  • 140
  • 166
pouyan
  • 3,445
  • 4
  • 26
  • 44
1

This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.

In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]

alexhg
  • 690
  • 7
  • 11
0

I want to add two points:

1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example, Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour

2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

pitfall
  • 2,531
  • 1
  • 21
  • 21