TensorFlow 2-gpu slower then single gpu

Question

I have two gpu (TitanX (Pascal) and GTX 1080). I am trying to run single-thread graph computation. The graph is two separate matrix multiplication chains (each assigned to corresponding gpu).

Here is the code that I'm using:

import tensorflow as tf import numpy as np import random import time import logging

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline


def test():
    n = 5000

    with tf.Graph().as_default():
        A1 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        A2 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        with tf.device('/gpu:0'):
            B1 = A1
            for l in xrange(10):
                B1 = tf.matmul(B1, A1)

        with tf.device('/gpu:1'):
            B2 = A2
            for l in xrange(10):
                B2 = tf.matmul(B2, A2)
            C = tf.matmul(B1, B2)

        run_metadata = tf.RunMetadata()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
            start = time.time()
            logging.info('started')
            A1_ = np.random.rand(n, n)
            A2_ = np.random.rand(n, n)
            sess.run([C],
                     feed_dict={A1: A1_, A2: A2_},
                     options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
                     run_metadata=run_metadata)
            logging.info('writing trace')
            trace = timeline.Timeline(step_stats=run_metadata.step_stats)
            trace_file = open('timeline.ctf.json', 'w')
            trace_file.write(trace.generate_chrome_trace_format())
            logging.info('trace written')
            end = time.time()
            logging.info('computed')
            logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

It takes 20.4 secs to finish.
It takes 14 secs to finish if I set all ops to gpu0 (TitanX).
It takes 19.8 secs to finish if I set all ops to gpu1 (GTX 1080).

I can see that tensorflow found both gpus and have set all devices correctly. Why is there no speed up for using two gpu instead of one? Could be the problem that the gpus are different models (AFAIK cuda allows it)?

Thanks.

EDIT I updated code to use different initial matrices for both chains since otherwise tensorflow seems to do some optimizations.

Here is a timeline profile json-file link: https://api.myjson.com/bins/23csi

Screenshot

This timeline raises more questions than answers:

Why pid 7 (gpu0) has two lines of execution?
What are long MatMuls in pid 3 and 5? (input0 "_recv_A_0/_3", input1 "_recv_A_0/_3", name "MatMul", op "MatMul")
It seems that every op is executed on gpu0 execept pid 5.
There are a lot of small MatMul ops (can't be seen from the screenshot) right after the long MatMuls ops from pid 3 and pid 5. What is this?

You can look at the timeline to see what's the bottleneck https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659 — Yaroslav Bulatov, Nov 03 '16 at 20:12
Also, you can do `sess.run(C.op)` instead of `sess.run(C)`, that TensorFlow->Python transfer from your timing — Yaroslav Bulatov, Nov 04 '16 at 01:22
I got an error "TypeError: __init__() got an unexpected keyword argument 'run_metadata'' in Session constructor. I installed tensorflow from source in Sept'16, and just tried to reinstall it from pip (still got the same error). — MikleB, Nov 04 '16 at 10:44
@YaroslavBulatov thanks for the timeline profiler advice. I have updated the post with more questions. — MikleB, Nov 04 '16 at 16:21
gpus have multiple streams, so some stuff is duplicated in the display -- it can show the same computation in the "compute" channel for the GPU as well as in the dedicated "stream" channel. The long ops are probably due to initial kernel launch overhead, posted benchmark with pre-warming — Yaroslav Bulatov, Nov 04 '16 at 19:50

Yaroslav Bulatov · Accepted Answer · 2016-11-16T17:10:33.973

There's significant delay when launching kernel for the first time on a GPU, possibly caused by PTXAS compilation. This delay can be on the order of seconds and accumulates when you use more than 1 GPUs, so in your case the run is slower because time is dominated by an extra "initial kernel launch". One way to benchmark pure computation time is to to "pre-warming" by executing each cuda operation at least once on each GPU. I've observed the same slowness by running your benchmark on 2 TitanX cards, but this delay disappeared when I "pre-warmed" the kernels.

Here's before pre-warming:

Here's after pre-warming: Below is your code modified to do pre-warming, and also to remove any TensorFlow<->Python transfers.

import tensorflow as tf

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline
import logging, time
import numpy as np

def test():
    n = 5000

    with tf.device('/gpu:0'):
        A1 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A1')
        B1 = A1
        for l in xrange(10):
            B1 = tf.matmul(A1, B1, name="chain1")

    with tf.device('/gpu:1'):
        A2 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A2')
        B2 = A2
        for l in xrange(10):
            B2 = tf.matmul(A2, B2, name="chain2")
        C = tf.matmul(B1, B2)

    run_metadata = tf.RunMetadata()
    start = time.time()
    logging.info('started')
    sess = tf.InteractiveSession(config=tf.ConfigProto(allow_soft_placement=False, log_device_placement=True))
    sess.run(tf.initialize_all_variables())
    # do warm-run
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    run_metadata = tf.RunMetadata()
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    logging.info('writing trace')
    trace = timeline.Timeline(step_stats=run_metadata.step_stats)
    trace_file = open('timeline.ctf.json', 'w')
    trace_file.write(trace.generate_chrome_trace_format(show_memory=True))
    logging.info('trace written')
    end = time.time()
    logging.info('computed')
    logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

btw, labels are a misleading "gpu:0/stream:22" in timeline is actually on the gpu:1, as seen from log_device_placement — Yaroslav Bulatov, Nov 04 '16 at 19:55
Thanks for the clarification. But these timeline traces still look strange to me. Why if I assign all ops on gpu0 then they are always all computed consequently (I tried different matrix sizes from 5000 to 20000, and different chain lengthes up to 100)? It seems that these two chains could be computed in two parallel streams even on single gpu. — MikleB, Nov 05 '16 at 13:35
That's correct, tensorflow doesn't schedule ops on parallel streams, instead each op is able to use all streams of the GPU — Yaroslav Bulatov, Nov 05 '16 at 19:15

score 0 · Answer 2 · answered Nov 03 '16 at 22:20

0

Isn't it because you need to transfer data between the GPUs when you compute C ? Can you try putting C on cpu ?

with tf.device('/cpu:0'):
  C = tf.matmul(B1, B2)

answered Nov 03 '16 at 22:20

RaduK

1,463
10
16

It ddin't help. Also I think it shouldn't matter since there are 500 matrix mlts for each gpu before the last one. – MikleB Nov 04 '16 at 10:45

TensorFlow 2-gpu slower then single gpu

2 Answers2

Linked