After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism.
I was told it's possible with the current implementation by having all GPUs in single box as a worker, but I am not able to figure out how to tie the average gradients with SyncReplicasOptimizer, since SyncReplicasOptimizer directly takes the optimizer as input.
Any ideas from anyone?