0

After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism.

I was told it's possible with the current implementation by having all GPUs in single box as a worker, but I am not able to figure out how to tie the average gradients with SyncReplicasOptimizer, since SyncReplicasOptimizer directly takes the optimizer as input.

Any ideas from anyone?

1 Answers1

2

Distributed TensorFlow supports multiple GPUs in the same worker task. One common way to perform distributed training for image models is to perform synchronous training across multiple GPUs in the same worker, and asynchronous training across workers (though other configurations are possible). This way you only pull the model parameters to the worker once, and they are distributed among the local GPUs, easing the network bandwidth utilization.

To do this kind of training, many users perform "in-graph replication" across the GPUs in a single worker. This can use an explicit loop across the local GPU devices, like in the CIFAR-10 example model; or higher-level library support, like in the model_deploy() utility from TF-Slim.

mrry
  • 125,488
  • 26
  • 399
  • 400