Tensorflow: Multi-GPU training cannot make all GPU running at the same time

Question

I have a machine that has 3x 1080 GPU. Below are the code of the training:

    dynamic_learning_rate = tf.placeholder(tf.float32, shape=[])
    model_version = tf.constant(1, tf.int32)

    with tf.device('/cpu:0'):
        with tf.name_scope('Input'):
            # Input images and labels.
            batch_images,\
                batch_input_vectors,\
                batch_one_hot_labels,\
                batch_file_paths,\
                batch_labels = self.get_batch()

    grads = []
    pred = []
    cost = []

    # Define optimizer
    optimizer = tf.train.MomentumOptimizer(learning_rate=dynamic_learning_rate / self.batch_size,
                                           momentum=0.9,
                                           use_nesterov=True)

    split_input_image = tf.split(batch_images, self.num_gpus)
    split_input_vector = tf.split(batch_input_vectors, self.num_gpus)
    split_input_one_hot_label = tf.split(batch_one_hot_labels, self.num_gpus)
    for i in range(self.num_gpus):
        with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
            with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
                with tf.name_scope('Model'):
                    # Construct model
                    with tf.variable_scope("inference"):
                        tower_pred = self.model(split_input_image[i], split_input_vector[i], is_training=True)

                    pred.append(tower_pred)

                with tf.name_scope('Loss'):
                    # Define loss and optimizer
                    softmax_cross_entropy_cost = tf.reduce_mean(
                        tf.nn.softmax_cross_entropy_with_logits(logits=tower_pred, labels=split_input_one_hot_label[i]))

                    cost.append(softmax_cross_entropy_cost)

    # Concat variables
    pred = tf.concat(pred, 0)
    cost = tf.reduce_mean(cost)

    # L2 regularization
    trainable_vars = tf.trainable_variables()
    l2_regularization = tf.add_n(
        [tf.nn.l2_loss(v) for v in trainable_vars if any(x in v.name for x in ['weights', 'biases'])])
    for v in trainable_vars:
        if any(x in v.name for x in ['weights', 'biases']):
            print(v.name + ' - included for L2 regularization!')
        else:
            print(v.name)

    cost = cost + self.l2_regularization_strength*l2_regularization

    with tf.name_scope('Accuracy'):
        # Evaluate model
        correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(batch_one_hot_labels, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        prediction = tf.nn.softmax(pred, name='softmax')

    # Creates a variable to hold the global_step.
    global_step = tf.Variable(0, trainable=False, name='global_step')

    # Minimization
    update = optimizer.minimize(cost, global_step=global_step, colocate_gradients_with_ops=True)

After I run the training:

Fri Nov 10 12:28:00 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   65C    P2    62W / 198W |   7993MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:04:00.0 Off |                  N/A |
| 33%   53C    P2   150W / 198W |   7886MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:05:00.0  On |                  N/A |
| 26%   54C    P2   170W / 198W |   7883MiB /  8108MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23228      C   python                                      7982MiB |
|    1     23228      C   python                                      7875MiB |
|    2      4793      G   /usr/lib/xorg/Xorg                            40MiB |
|    2     23228      C   python                                      7831MiB |
+-----------------------------------------------------------------------------+

Fri Nov 10 12:28:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   59C    P2    54W / 198W |   7993MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:04:00.0 Off |                  N/A |
| 33%   57C    P2   154W / 198W |   7886MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:05:00.0  On |                  N/A |
| 27%   55C    P2   155W / 198W |   7883MiB /  8108MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23228      C   python                                      7982MiB |
|    1     23228      C   python                                      7875MiB |
|    2      4793      G   /usr/lib/xorg/Xorg                            40MiB |
|    2     23228      C   python                                      7831MiB |
+-----------------------------------------------------------------------------+

You see that the whenever the first GPU is running, the other two GPUs will be idle and vice versa. The alternate frequency is about 0.5 second.

For a single GPU, the training speed is around 650 [images/second],
with all the 3 GPUs I got only 1050 [images/second].

Any idea of the problem?

sounds like one GPU is acting as the parameter server and the other 2 are doing the training — Robert Crovella, Nov 10 '17 at 05:12
So i need to make sure the parameter sever is operated in CPU? Or i need to make the parameter server distributed on all GPU? BTW, i dont know how to configure the parameter server in the Tensorflow. — Max Shek-wai Chu, Nov 10 '17 at 05:57

score 0 · Answer 1 · answered Feb 19 '19 at 13:06

0

You need to make sure that all the trainable variables are on the controller device (usually the CPU) and all the other worker devices (usually GPUs) are using the variables from the CPU in parallel.

answered Feb 19 '19 at 13:06

Ido Loebl

61
7

Tensorflow: Multi-GPU training cannot make all GPU running at the same time

1 Answers1