I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:
python train_image_classifier.py \
--train_dir=${TRAIN_DIR}/all \
--dataset_name=cifar10 \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--checkpoint_path=${TRAIN_DIR} \
--model_name=resnet_v1_50 \
--max_number_of_steps=3000 \
--batch_size=32 \
--num_clones=4 \
--learning_rate=0.0001 \
--save_interval_secs=10 \
--save_summaries_secs=10 \
--log_every_n_steps=10 \
--optimizer=sgd
For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).
I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!