Why would large mini batches take longer to run?

Question

I am training an autoencoder on MNIST, and noticed that increasing the batch size after 128, starts taking more computation time on a fixed dataset size.

I am using tensorflow-gpu and have GeForce GTX 1070.

I tried running a couple of tests on a fixed training set of 5000 samples (784 dim), and ran for 10 epochs. The batches are consecutive batch-size chunks from the 5000 training samples, so the number of iterations effectivaly depends on the batch size.

I tracked the performance on this data (loss), execution time and the GPU memory usage of the python process (from nvidia-smi output):

5000 datapoints 10 epochs

batch size
512:       loss: 53.7472; execution took 00:00:13,787; 4281MiB
256:       loss: 48.1941; execution took 00:00:04,973; 695MiB
128:       loss: 42.7486; execution took 00:00:03,350; 439MiB
64:        loss: 40.0781; execution took 00:00:04,191; 439MiB
32:        loss: 37.7348; execution took 00:00:06,487; 441MiB
16:        loss: 36.6291; execution took 00:00:12,102; 441MiB
8:         loss: nan;     execution took 00:00:23,115; 441MiB

When I try minibatch sizes larger than 512 I get Out Of Memory errors.

I guess it makes sense for the smaller batches to take longer to execute, as there will be more updates in sequence for the same date. However, I am not sure why the computation time increases when minibatch is larger than 128 samples, instead of decreasing further.

One assumption is it has to do with the GPU getting full and unable to parallelise properly, but I couldn't find any such comments online.

score 1 · Answer 1 · answered Apr 30 '19 at 13:49

While bigger batches mean less total updates in each epoch, it also means each batch will take more time to process, and while making the batch size larger makes the total number of batches smaller, it also might result in slower convergence.
So as you can see there is a trade-off. You will have to find the optimum batch size for your dataset. For MNIST batch sizes are usually between 50 to 150.
I'm not sure how you are loading batches from database, but if used in the right way, an advantage of batching is that you don't have to load the whole data into your RAM. So it's only natural that a big batch size will result in out of memory error.

When you say "...it also means each batch will take more time to process..." what does this exactly mean - is there some overhead for loading the batch to the GPU or averaging the gradient? In this particular case I am not interested in convergence time, rather the execution time of different batch sizes on a fixed dataset. I load 5000 samples in my RAM (fits OK) and then take consecutive batches. The OOM issue arises with the GPU memory when the batch is too large. — El Rakone, May 01 '19 at 15:01

David Ng · Answer 2 · 2019-04-30T14:11:43.577

When you train your model with the smaller batch size, your model gets updated more often although more stochastically. This help to converge faster in each epoch. For example, if you train on batch_size of 500, you will update your model parameters 100 times in 1 epoch but if you train on batch_size of 50, the number of updates in 1 epoch is 1000.
When you train your model with the larger batch size, each update is more stable and less stochastic.
When you train your model with the larger batch size, it makes use of vectorized computation on CPU or GPU for example with batch_size = 1 we get more updates but no advantage of vectorization. But when you train on very large batch size which is higher than the capacity of memory, then it's out of memory.

Usually, we balance the the number of batch size and the speed to convergence.

Why would large mini batches take longer to run?

2 Answers2