0

I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:

python train_image_classifier.py \                    
  --train_dir=${TRAIN_DIR}/all \                                                        
  --dataset_name=cifar10 \                                                              
  --dataset_split_name=train \                                                          
  --dataset_dir=${DATASET_DIR} \                                                        
  --checkpoint_path=${TRAIN_DIR} \                                                      
  --model_name=resnet_v1_50 \                                                           
  --max_number_of_steps=3000 \                                                          
  --batch_size=32 \                                                                     
  --num_clones=4 \                                                                      
  --learning_rate=0.0001 \                                                              
  --save_interval_secs=10 \                                                             
  --save_summaries_secs=10 \                                                            
  --log_every_n_steps=10 \                                                                 
  --optimizer=sgd  

For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).

I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!

Anas
  • 866
  • 1
  • 13
  • 23
  • Timeline would help identify the performance bottleneck. Usage of timeline: http://stackoverflow.com/questions/36123740/is-there-a-way-of-determining-how-much-gpu-memory-is-in-use-by-tensorflow/37931964#37931964 – Yao Zhang Apr 05 '17 at 06:42
  • @YaoZhang I've kept track of the GPU usage through nvidia-smi, and there are bursts of all 4 GPUs being used at around 90+% followed by moments of 0%, and chronically like this all throughout training. – Anas Apr 05 '17 at 13:46
  • This is better answered if you file an issue on [Github](https://github.com/tensorflow/tensorflow/issues) – keveman Apr 11 '17 at 16:48

1 Answers1

1
  1. You will not get faster When set num_clones to use multi gpu. Because slim will train batch_size * num_clones data split in each of your GPU. After that calculate each loss by div num_clones and sum the total loss. (https://github.com/tensorflow/models/blob/master/research/slim/deployment/model_deploy.py)
  2. When CPU become the bottleneck, input pipeline cannot product so much data for train. Then you will get 4 times slowly when set num_clones=4.(https://www.tensorflow.org/performance/performance_guide)
bottlerun
  • 51
  • 6
  • What can be done in this case then to speed up training? Thanks. – Anas Dec 15 '17 at 15:30
  • @Anas find the bottleneck first. Have a look at the second link I Posted. I'm learning to use timeline to profile now. You can try that too. – bottlerun Dec 20 '17 at 10:02