Tensorflow: simultaneous prediction on GPU and CPU

Question

I’m working with tensorflow and I want to speed up the prediction phase of a pre-trained Keras model (I'm not interested in the training phase) by using simultaneously the CPU and one GPU.

I tried to create 2 different threads that feed two different tensorflow sessions (one that runs on CPU and the other that runs on GPU). Each thread feeds a fixed number of batches (e.g. if we have an overall of 100 batches, I want to assign 20 batches for CPU and 80 on GPU, or any possible combination of the two) in a loop and combine the result. It would be better if the split was done automatically.

However even in this scenario, it seems that the batches are being fed in a synchronous way, because even sending few batches to the CPU and computing all the others in the GPU (with the GPU as bottleneck) I observed that the overall prediction time is always higher with respect to the test made only using the GPU.

I would expect it to be faster because when only the GPU is working the CPU usage is about 20-30%, thus there is some CPU available to speed up the computation.

I read a lot of discussions but they all deal with parallelism with multiple GPUs and not between GPU and CPU.

Here is a sample of the code I have written: the tensor_cpu and tensor_gpu objects are loaded from the same Keras model in this way:

with tf.device('/gpu:0'):
    model_gpu = load_model('model1.h5')
    tensor_gpu = model_gpu(x)

with tf.device('/cpu:0'):
    model_cpu = load_model('model1.h5')
    tensor_cpu = model_cpu(x)

Then the prediction is done as following:

def predict_on_device(session, predict_tensor, batches):
    for batch in batches:
        session.run(predict_tensor, feed_dict={x: batch})


def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu):
    session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session1.run(tf.global_variables_initializer())
    session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session2.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu]))
    t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:]))

    t_cpu.start()
    t_gpu.start()

    coord.join([t_cpu, t_gpu])

    session1.close()
    session2.close()

How can I achieve this CPU/GPU parallelization? I think I'm missing something.

Any kind of help would be very appreciated!

Yes, yes, yes!! I'm sorry for the late answer, I was busy with another project and I had no time to try this out. I checked your code.. could it be that the only reason why it didn't work was the intra_op_parallelism_thread option? — battuzz, Jun 03 '17 at 16:12
Any idea on how I can let tensorflow find the right amount of batches to feed to CPU and GPU so that I can minimize the total prediction time? — battuzz, Jun 03 '17 at 16:15

MWB · Accepted Answer · 2017-05-30T23:09:22.927

Here's my code that demonstrates how CPU and GPU execution can be done in parallel:

import tensorflow as tf
import numpy as np
from time import time
from threading import Thread

n = 1024 * 8

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n    , n]).astype(np.float32)

with tf.device('/cpu:0'):
    x = tf.placeholder(name='x', dtype=tf.float32)

def get_var(name):
    return tf.get_variable(name, shape=[n, n])

def op(name):
    w = get_var(name)
    y = x
    for _ in range(8):
        y = tf.matmul(y, w)
    return y

with tf.device('/cpu:0'):
    cpu = op('w_cpu')

with tf.device('/gpu:0'):
    gpu = op('w_gpu')

def f(session, y, data):
    return session.run(y, feed_dict={x : data})


with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
    sess.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    threads = []

    # comment out 0 or 1 of the following 2 lines:
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))]

    t0 = time()

    for t in threads:
        t.start()

    coord.join(threads)

    t1 = time()


print t1 - t0

The timing results are:

CPU thread: 4-5s (will vary by machine, of course).
GPU thread: 5s (It does 16x as much work).
Both at the same time: 5s

Note that there was no need to have 2 sessions (but that worked for me too).

The reasons you might be seeing different results could be

some contention for system resources (GPU execution does consume some host system resources, and if running the CPU thread crowds it, that could worsen the performance)
incorrect timing
part of your model can only run on GPU/CPU
bottleneck elsewhere
some other problem

Tensorflow: simultaneous prediction on GPU and CPU

1 Answers1

Linked