I am using MXNet to finetune Resnet model on Caltech 256 dataset from the following example:
https://mxnet.incubator.apache.org/how_to/finetune.html
I am primarily doing it for a POC to test distributed training (which I'll later use in my actual project).
First I ran this example on a single machine with 2 GPUs for 8 epochs. I took around 20 minutes and the final validation accuracy was 0.809072.
Then I ran it on 2 machines (identical, each with 2 GPUs) with distributed setting and partitioned the training data in half for these two machines (using num_parts
and part_index
).
8 epochs took only 10 minutes, but the final validation accuracy was only 0.772847 (highest of the two). Even when I used 16 epochs, I was only able to achieve 0.797006.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage? Maybe I am missing something.
I can post my code and run command if required.
Thanks
EDIT
Some more info to help with the answer:
MXNet version: 0.11.0
Topology: 2 workers (each on a separate machine)
Code: https://gist.github.com/reactivefuture/2a1f9dcd3b27c0fe8215b4e3d25056ce
Command to start:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
I have used a hacky way to do partitioning (using IP addresses) since I couldn't get kv.num_workers
and kv.rank
to work.