Distributed training using multiple GPUs with tensorflow.slim.learning

Question

I understand that TensorFlow supports distributed training.

I find num_clones in train_image_classifier.py so that I can use multiple GPUs locally.

python $TF_MODEL_HOME/slim/train_image_classifier.py \
--num_clones=2
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=vgg_19 \
--batch_size=32 \
--max_number_of_steps=100

How do I use multiple GPUs on different hosts?

score 1 · Answer 1 · answered Nov 29 '17 at 04:07

1

You need to use --worker_replicas=<no of hosts> to train on multiple hosts with same number of GPUs. Apart from that, you have to configure --task, --num_ps_tasks, --sync_replicas, --replicas_to_aggregate if you are training on multiple hosts.

I'd suggest you give Horovod a try. I'm planning to give it a try in a couple of days.

answered Nov 29 '17 at 04:07

SnShines

95
9

If i have 192.168.0.1 and 192.168.0.2, one gpu each host, then i run above cmd at 192.168.0.1, add `--worker_replicas="192.168.0.2" --task=1 --num_ps_tasks=1 --sync_replicas=true`, right? – daixiang0 Nov 29 '17 at 06:04
@SnShines Assuming a scenario of 2 machines (server1 and server2) with 3 gpus on each, care to provide a concrete example of how many processes should be spawned and example values for the flags that you described? – ZeDuS Dec 21 '17 at 21:47
1

Hi, any chance you could explain the difference between `worker_replicas`, `ps_tasks`, `num_ps_pasks`, `task`, `num_replicas`, `num_clones`? – Austin Jul 16 '18 at 16:41
@Austin, did you understand the difference between worker_replicas, ps_tasks, num_ps_pasks, task, num_replicas, num_clones? – Sanjay Mar 30 '21 at 13:09

Distributed training using multiple GPUs with tensorflow.slim.learning

1 Answers1