Multi GPU training slower than single GPU on Tensorflow

Question

I have created 3 virtual GPU's (have 1 GPU) and try to speedup vectorization on images. However, using provided below code with manual placement from off docs (here) I got strange results: training on all GPU two times slower than on a single one. Also check this code(and remove virtual device initialization) on machine with 3 physical GPU's - work the same.

Environment: Python 3.6, Ubuntu 18.04.3, tensorflow-gpu 1.14.0.

Code(this example create 3 virtual devices and you could test it on a PC with one GPU):

import os
import time
import numpy as np
import tensorflow as tf

start = time.time()

def load_graph(frozen_graph_filename):
    # We load the protobuf file from the disk and parse it to retrieve the
    # unserialized graph_def
    with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())

    # Then, we import the graph_def into a new Graph and returns it
    with tf.Graph().as_default() as graph:
        # The name var will prefix every op/nodes in your graph
        # Since we load everything in a new graph, this is not needed
        tf.import_graph_def(graph_def, name="")
    return graph

path_to_graph = '/imagenet/'  # Path to imagenet folder where graph file is placed
GRAPH = load_graph(os.path.join(path_to_graph, 'classify_image_graph_def.pb'))

# Create Session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
session = tf.Session(graph=GRAPH, config=config)

output_dir = '/vectors/'  # where to saved vectors from images

# Single GPU vectorization
for image_index, image in enumerate(selected_list):
    with Image.open(image) as f:
        image_data = f.convert('RGB')
        feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
        feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
        feature_vector = np.squeeze(feature_vector)
        outfile_name = os.path.basename(image) + ".vc"
        out_path = os.path.join(output_dir, outfile_name)
        # Save vector
        np.savetxt(out_path, feature_vector, delimiter=',')

print(f"Single GPU: {time.time() - start}")
start = time.time()

print("Start calculation on multiple GPU")
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Create 3 virtual GPUs with 1GB memory each
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

print("Create prepared ops")
start1 = time.time()
gpus = logical_gpus  # comment this line to use physical GPU devices for calculations

image_list = ['1.jpg', '2.jpg', '3.jpg']  # list with images to vectorize (tested on 100 and 1000 examples)
# Assign chunk of list to each GPU
# image_list1, image_list2, image_list3 = image_list[:len(image_list)],\
#                                         image_list[len(image_list):2*len(image_list)],\
#                                         image_list[2*len(image_list):]
selected_list = image_list # commit this line if you want to try to assign chunk of list manually to each GPU
output_vectors = []
if gpus:
  # Replicate your computation on multiple GPUs
  feature_vectors = []
  for gpu in gpus:  # iterating on a virtual GPU devices, not physical
    with tf.device(gpu.name):
      print(f"Assign list of images to {gpu.name.split(':', 4)[-1]}")
      # Try to assign chunk of list with images to each GPU - work the same time as single GPU
      # if gpu.name.split(':', 4)[-1] == "GPU:0":
      #     selected_list = image_list1
      # if gpu.name.split(':', 4)[-1] == "GPU:1":
      #     selected_list = image_list2
      # if gpu.name.split(':', 4)[-1] == "GPU:2":
      #     selected_list = image_list3
      for image_index, image in enumerate(selected_list):
          with Image.open(image) as f:
            image_data = f.convert('RGB')
            feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
            feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
            feature_vectors.append(feature_vector)

print("All images has been assigned to GPU's")
print(f"Time spend on prep ops: {time.time() - start1}")
print("Start calculation on multiple GPU")
start1 = time.time()
for image_index, image in enumerate(image_list):
  feature_vector = np.squeeze(feature_vectors[image_index])
  outfile_name = os.path.basename(image) + ".vc"
  out_path = os.path.join(output_dir, outfile_name)
  # Save vector
  np.savetxt(out_path, feature_vector, delimiter=',')

# Close session
session.close()
print(f"Calc on GPU's spend: {time.time() - start1}")
print(f"All time, spend on multiple GPU: {time.time() - start}")

Provide view of output(from list with 100 images):

1 Physical GPU, 3 Logical GPUs
Single GPU: 18.76301646232605
Start calculation on multiple GPU
Create prepared ops
Assign list of images to GPU:0
Assign list of images to GPU:1
Assign list of images to GPU:2
All images has been assigned to GPU's
Time spend on prep ops: 18.263537883758545
Start calculation on multiple GPU
Calc on GPU's spend: 11.697082042694092
All time, spend on multiple GPU: 29.960679531097412

What I tried: split list with images into 3 chunks and assign each chunk to GPU(see commited lines of code). This reduce multiGPU time to 17 seconds, which a little bit faster than single GPU run 18 seconds (~5%).

Expected results: MultiGPU version is faster than singleGPU version (at least 1.5x speedup).

Ideas, why it maybe happens: I wrote calculation in a wrong way

*Expected results: MultiGPU version is faster than singleGPU version (at least 1.5x speedup).* - that expectation might not have strong roots in reality. Have you checked GPU utilization? (https://askubuntu.com/questions/387594/how-to-measure-gpu-usage may help). If the single-GPU setting shows your GPU is fully or nearly fully utilized, splitting it into multiple virtual devices will perform worse for sure as context-switching is an operation which takes time. — tevemadar, Jan 03 '20 at 10:13

score 7 · Accepted Answer · edited Sep 02 '20 at 19:39

There are two basic misunderstandings that are causing your trouble:

with tf.device(...): applies to the graph nodes created within the scope, not Session.run calls.
Session.run is a blocking call. They don't run in parallel. TensorFlow can only parallelize the contents of a single Session.run.

Modern TF (>= 2.0) can make this much easier.

Mainly you can stop using tf.Session and tf.Graph. Use @tf.function instead, I believe this basic structure will work:

@tf.function
def my_function(inputs, gpus, model):
  results = []
  for input, gpu in zip(inputs, gpus):
    with tf.device(gpu):
      results.append(model(input))    
  return results

But you will want to try a more realistic test. With just 3 images you're not at all measuring real performance.

Also note:

The tf.distribute.Strategy class may help simplify some of this, by separating the device specification from the @tf.function that's being run. strategy.experimental_run_v2(my_function, args=(dataset_inputs,))
Using tf.data.Dataset input pipelines will help you overlap loading/preprocessing with model execution.

But if you're really intent on doing this using tf.Graph and tf.Session I think you basically need to reorganize your code from this:

# Your code:
# Builds a graph
graph = build_graph()

for gpu in gpus():
  with tf.device(gpu):
    # Calls `gpu` in each device scope.
    session.run(...)

To this:

g = tf.Graph()
with g.as_default():
  results = []
  for gpu in gpus:
    # Build the graph, on each device
    input = iterator.get_next()
    with tf.device(gpu):    
      results.append(my_function(input))       

# Use a single `Session.run` call
np_result = session.run(results, feed_dict={inputs: my_inputs})

Can you help me how to send to `feed_dict` multiple `image_data`? Workable example on one image: `two_vectors = session.run([feature_tensor, softmax_tensor], feed_dict={'DecodeJpeg:0': image_data})`. But, when I tried to send in a such way `feed_dict={'DecodeJpeg:0': [image_data1, image_data2]}` - this is not worked (error saying: required string on int). Also tried `feed_dict={'DecodeJpeg:0': image_data1, 'x': image_data2}` - error saying x is not known in the graph. Tried to add `tf.placeholder` - not helped also — Dmitriy Kisil, Jan 10 '20 at 08:36
Don't use feed dicts. Or: `feed_dict = {tower_1:image_batch_1, tower_2:image_batch_2, tower_3:image_batch_3}` but loading won't run in parallel with execution. — mdaoust, Jan 10 '20 at 15:01

Multi GPU training slower than single GPU on Tensorflow

1 Answers1