I'm having trouble understanding and replicating the original implementation of ResNet on the CIFAR-10 dataset, as described in the paper "Deep Residual Learning for Image Recognition". Specifically, I have a few questions about the following passage:
We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.
What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?
How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?
How can I implement the training and learning rate scheduling in PyTorch? Since 45000 isn't divisible by 128, should I drop the last 72 images every epoch? Also, since the 32k, 48k, and 64k milestones don't fall on a whole number of epochs, should I round them to the nearest epochs? Or is there a way to change the learning rate and terminate training in the middle of an epoch?
If anyone could point me in the right direction, I greatly appreciate it. I'm new to deep learning, so thank you for your help and kind understanding.