0

We've just gotten a multi-gpu machine at work, and I'm trying to verify that 2 GPU's on Caffe are better than 1. To do this, I'm using the quick train example of the CIFAR-10 dataset. So, far, I'm finding that 2GPU's slows things down and I don't understand why.

The caffe version I'm running is:

me@ubuntu:~/Downloads/caffe$ ./build/tools/caffe -version
caffe version 1.0.0-rc3

The topology of our GPU's is as follows:

me@ubuntu:~$ nvidia-smi topo -m
       GPU0     GPU1    GPU2    GPU3    CPU Affinity
GPU0     X       PIX    PHB     PHB      0-11
GPU1    PIX      X      PHB     PHB      0-11
GPU2    PHB     PHB       X     PIX      0-11
GPU3    PHB     PHB     PIX      X       0-11

Legend:
  X   = Self
  SOC = PCI path traverses a socket-level link (e.g. QPI)
  PHB = PCI path traverses a host bridge
  PXB = PCI path traverses multiple internal switches
  PIX = PCI path traverses an internal switch
  NV# = Path traverses # NVLinks

The processes they are each handling are as follows:

me@ubuntu:~$ nvidia-smi pmon
      # gpu     pid  type    sm   mem   enc   dec   command
      # Idx       #   C/G     %     %     %     %   name
         0      1679   G      0     0     0     0   X              
         0      2740   G      0     1     0     0   compiz         
         0      3600   G      0     0     0     0   firefox        
         1       -     -      -     -     -     -   -              
         2       -     -      -     -     -     -   -              
         3      3328   C     0     0     0     0   python   

I trained on the CIFAR-10 dataset using this basic script:

#!/usr/bin/env sh
set -e

TOOLS=./build/tools

$TOOLS/caffe train \
  --solver=examples/cifar10/cifar10_quick_solver.prototxt --gpu=1,2 $@ >> ~/Desktop/caffe_2GPUa_out.txt 2>&1

with the slight variations of:

 --gpu=2,3

and

 --gpu=2

I would've expected the fastest result to be obtained by --gpu=2,3, followed by --gpu=1,2, then by --gpu=2. Instead, I saw the exact opposite.

What I saw was this,

For --gpu=2:

I0227 14:41:26.948098  7712 caffe.cpp:251] Starting Optimization
I0227 14:42:04.841394  7712 caffe.cpp:254] Optimization Done.

For --gpu=1,2:

I0227 15:22:56.675775  7946 parallel.cpp:425] Starting Optimization
I0227 15:23:39.097970  7946 caffe.cpp:254] Optimization Done.

For --gpu=2,3:

I0227 14:43:13.466243  7742 parallel.cpp:425] Starting Optimization
I0227 14:43:56.215469  7742 caffe.cpp:254] Optimization Done.

So, my resulting times to train are:

 gpu=2     34.89 sec
 gpu=1,2   42.42 sec
 gpu=2,3   42.74 sec

Clearly, there's something I don't understand about running multiple GPU's with Caffe. I had expected using 2 GPU's would give me a speed-up of about 1.8X. What am I not understanding here?

talonmies
  • 70,661
  • 34
  • 192
  • 269
user1245262
  • 6,968
  • 8
  • 50
  • 77
  • here are a lot of extra tasks involved in coordinating multiple nodes, and you may have run afoul of having a job too simple to take "proper" use of the dual-node situation. Can you try this again on, say, AlexNet or GoogLeNet for at least 500 iterations? Also, try looking at the total time *after* the training begins. – Prune Feb 27 '17 at 22:40
  • @Prune - Thanks. It'll be a few days before I can try again, but I'll give larger problems a try - but I *was* looking at total time after training begins Optimization start to Optimization end is essentially pure training – user1245262 Feb 28 '17 at 02:47
  • Darn. Okay, you've probably covered it already. I work in the CPU world, and I haven't had occasion to try multi-node CIFAR. I do know that I get good scaling going from 1 to 2 nodes. – Prune Feb 28 '17 at 17:01
  • Also, do you have a way to specify multi-node protocol, but run on only one node? I use that as another data point for my work. – Prune Feb 28 '17 at 17:01
  • @Prune - How do you do that? (specify multi-node, but only run single node) – user1245262 Feb 28 '17 at 17:40
  • In my context, I run **mpirun**, but give it only one host IP address. MPI is the message-passing interface installed here. I don't know what is the cognate for GPU use, or if the interface matches at all. – Prune Feb 28 '17 at 17:51

0 Answers0