We've just gotten a multi-gpu machine at work, and I'm trying to verify that 2 GPU's on Caffe are better than 1. To do this, I'm using the quick train example of the CIFAR-10 dataset. So, far, I'm finding that 2GPU's slows things down and I don't understand why.
The caffe version I'm running is:
me@ubuntu:~/Downloads/caffe$ ./build/tools/caffe -version
caffe version 1.0.0-rc3
The topology of our GPU's is as follows:
me@ubuntu:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PIX PHB PHB 0-11
GPU1 PIX X PHB PHB 0-11
GPU2 PHB PHB X PIX 0-11
GPU3 PHB PHB PIX X 0-11
Legend:
X = Self
SOC = PCI path traverses a socket-level link (e.g. QPI)
PHB = PCI path traverses a host bridge
PXB = PCI path traverses multiple internal switches
PIX = PCI path traverses an internal switch
NV# = Path traverses # NVLinks
The processes they are each handling are as follows:
me@ubuntu:~$ nvidia-smi pmon
# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
0 1679 G 0 0 0 0 X
0 2740 G 0 1 0 0 compiz
0 3600 G 0 0 0 0 firefox
1 - - - - - - -
2 - - - - - - -
3 3328 C 0 0 0 0 python
I trained on the CIFAR-10 dataset using this basic script:
#!/usr/bin/env sh
set -e
TOOLS=./build/tools
$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_quick_solver.prototxt --gpu=1,2 $@ >> ~/Desktop/caffe_2GPUa_out.txt 2>&1
with the slight variations of:
--gpu=2,3
and
--gpu=2
I would've expected the fastest result to be obtained by --gpu=2,3
, followed by --gpu=1,2
, then by --gpu=2
. Instead, I saw the exact opposite.
What I saw was this,
For --gpu=2
:
I0227 14:41:26.948098 7712 caffe.cpp:251] Starting Optimization
I0227 14:42:04.841394 7712 caffe.cpp:254] Optimization Done.
For --gpu=1,2
:
I0227 15:22:56.675775 7946 parallel.cpp:425] Starting Optimization
I0227 15:23:39.097970 7946 caffe.cpp:254] Optimization Done.
For --gpu=2,3
:
I0227 14:43:13.466243 7742 parallel.cpp:425] Starting Optimization
I0227 14:43:56.215469 7742 caffe.cpp:254] Optimization Done.
So, my resulting times to train are:
gpu=2 34.89 sec
gpu=1,2 42.42 sec
gpu=2,3 42.74 sec
Clearly, there's something I don't understand about running multiple GPU's with Caffe. I had expected using 2 GPU's would give me a speed-up of about 1.8X. What am I not understanding here?