Building a Machine Learning Server

Question

We need to train large networks using TensorFlow that take several days to complete on a GPU. Amazon offers GPU instances like p2.16xlarge with e.g. 16 GPUs (NVIDIA K80). Now I was wondering if TensorFlow utilizes multiple GPUs efficiently or would I be just as fast using a desktop with a single Titan X?

Those GPU instances are quite expensive so I'm looking to build a machine myself (Linux based). SLI doesn't seem to work with CUDA so am I stuck with 1 GPU at a time?

score 0 · Answer 1 · answered Jan 13 '17 at 11:22

TensorFlow does utilize multiple GPUs very efficiently if using an appropriate script e.g. cifar10_multi_gpu_train.py

python cifar10_multi_gpu_train.py --num_gpus=X

Replace X with the number of GPUs. The workload is divided up and distributed across the GPUs, and they have taken into account things like the transfer of data between GPUs being relatively slow by getting the CPU involved to help compensate.

Using the 16 x Nvidia k80 should be a lot quicker than using a single Titan X, but how much quicker is hard to say. If you are happy for it to take longer, then obviously don't spend the money - it is up to you whether the time saving will justify the cost.

Further details: https://www.tensorflow.org/tutorials/deep_cnn/#training_a_model_using_multiple_gpu_cards

As I understand TensorFlow actually doesn't benefit from K80s double precision and memory correction, so I was told a Titan X would probably be better in that case. So how would 16 x Telsa K80 compare to 10 x Titan X (Pascal)? (thinkmate offers preconfigured servers with 10 Titan X cards) — , Jan 13 '17 at 12:47
The 10 x Titans could well be quicker, but without testing it is hard to say. — bao7uo, Jan 13 '17 at 13:00

Building a Machine Learning Server

1 Answers1