Distributed Training Terminology: Micro-batch and Per-Replica batch size

Question

I am reading through the Sagemaker documentation on distributed training and confused on the terminology:

Mini-Batch, Micro-batch and Per-replica batch size

I understand that in data parallelism, there would be multiple copies of the model and each copy would receive data of size = "Per Replica Batch Size"

Could someone ELI5 how micro-batch would fit in this context?
Is this a common terminology used in the terminology or is this specific to AWS Sagemaker

score 0 · Answer 1 · answered Aug 05 '22 at 22:36

Micro Batch comes into picture when you are using Model Parallel for training. In this case the model is sharded into multiple segments and loaded into different GPU's. In order to improve efficiency of GPU utilization model parallel training approaches will further divide the mini batch into micro batches. If you are using Data Parallel approach then you will only have global batch size and per replica batch size.

Distributed Training Terminology: Micro-batch and Per-Replica batch size

1 Answers1