Tensorflow - Multi-GPU doesn’t work for model(inputs) nor when computing the gradients

Question

When using multiple GPUs to perform inference on a model (e.g. the call method: model(inputs)) and calculate its gradients, the machine only uses one GPU, leaving the rest idle.

For example in this code snippet below:

import tensorflow as tf
import numpy as np
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

# Make the tf-data
path_filename_records = 'your_path_to_records'
bs = 128

dataset = tf.data.TFRecordDataset(path_filename_records)
dataset = (dataset
           .map(parse_record, num_parallel_calls=tf.data.experimental.AUTOTUNE)
           .batch(bs)
           .prefetch(tf.data.experimental.AUTOTUNE)
          )

# Load model trained using MirroredStrategy
path_to_resnet = 'your_path_to_resnet'
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    resnet50 = tf.keras.models.load_model(path_to_resnet)

for pre_images, true_label in dataset:
    with tf.GradientTape() as tape:
       tape.watch(pre_images)
       outputs = resnet50(pre_images)
       grads = tape.gradient(outputs, pre_images)

Only one GPU is used. You can profile the behavior of the GPUs with nvidia-smi. I don't know if it is supposed to be like this, both the model(inputs) and tape.gradient to not have multi-GPU support. But if it is, then it's a big problem because if you have a large dataset and need to calculate the gradients with respect to the inputs (e.g. interpretability porpuses) it might take days with one GPU. Another thing I tried was using model.predict() but this isn't possible with tf.GradientTape.

What I've tried so far and didn't work

Put all the code inside mirrored strategy scope.
Used different GPUs: I've tried A100, A6000 and RTX5000. Also changed the number of graphic cards and varied the batch size.
Specified a list of GPUs, for instance, strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1']).
Added this strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()) as @Kaveh suggested.

How do I know that only one GPU is working?

I used the command watch -n 1 nvidia-smi in the terminal and observed that only one GPU is at 100%, the rest are at 0%.

Working Example

You can find a working example with a CNN trained on the dogs_vs_cats datasets below. You won't need to manually download the dataset as I used the tfds version, nor train a model.

Notebook: Working Example.ipynb

Saved Model:

you probably need ton put all your code inside the mirrored strategy scope, right now only the model loading is inside scope. — Dr. Snoopy, Jul 07 '21 at 09:47
Hi @Dr.Snoopy, I did try that but the same behavior persisted. — mCalado, Jul 07 '21 at 09:53
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1", I am using two A100s and I check nvidia-smi — mCalado, Jul 07 '21 at 09:57
This `strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())` may resolve your issue. — Kaveh, Jul 07 '21 at 12:28
Have you tried to list gpus in your defiition? like this: `strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])`. — Kaveh, Jul 07 '21 at 13:09
Yes sir! I will update my question with all of those details that both of you pointed out. — mCalado, Jul 07 '21 at 13:18
Could you provide a minimal reproducible example? Right now when trying to reproduce the behaviour, I have the 2 GPUs used. In any cas you might want to look into: https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#run — Zaccharie Ramzi, Jul 10 '21 at 14:32
Another problem might be that when specifying the visible devices you need to avoid commas. I haven't tested this because I don't have access right now to 2 physical GPUs, but it rings a bell. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars — Zaccharie Ramzi, Jul 10 '21 at 14:33
Hi @ZaccharieRamzi. Thank you for your help! I will edit the question and add an example. Nevertheless, I have a couple of questions regarding your statement. 1. Can you actually give me a working example of this working? Does it actually compute the gradients when you use the run method? Can you print the result and check that the gradients are not None? 2. I don't understand your last comment. What do you mean by avoiding commas? Do you mean spaces between the numbers and the commas? Can't really find that in the link that you provided. Cheers. — mCalado, Jul 10 '21 at 18:17
Re #2 yes sorry I meant avoid spaces. Re #1 I will come back to you shortly with a colab illustrating what I tried, but working on your minimal example might be best — Zaccharie Ramzi, Jul 10 '21 at 20:17
Thank you. I agree, working on a minimal example is best. I will provide it ASAP. Thank you for the suggestion! — mCalado, Jul 10 '21 at 20:28
You can find the colab illustrating what I tried here: https://colab.research.google.com/drive/1jld4hhq0VNiWyzk1UPjLhp6kA5oPrMA0?usp=sharing I use weights and biases to monitor the gpu usage as per [this answer](https://stackoverflow.com/a/62654077/4332585) Because I am on colab, I have only one physical GPU and I therefore created 2 logical GPUs. Therefore when monitoring, a use of more than 50% (compute or memory) indicates that the distribution worked. The `grads` are not `None`. — Zaccharie Ramzi, Jul 10 '21 at 21:25
Hi @ZaccharieRamzi, sorry for the late reply. I've added a working example in the description. Also, I tried what you suggested. I don't think the example that you provide is too farfetched from the one that I gave in the code snippet. The problem remains the same :/ — mCalado, Jul 13 '21 at 12:11
First, try to print number of GPUs `strategy = tf.distribute.MirroredStrategy() ` `print('Number of devices: {}'.format(strategy.num_replicas_in_sync))` — aravinda_gn, Jul 16 '21 at 09:59
Try setting up these two `os.environ["NVIDIA_VISIBLE_DEVICES"] ="0,1"` `os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"` — aravinda_gn, Jul 16 '21 at 10:10
Hi @Aravinda_gn, thank you very much for your help! Regarding the first comment, I did that in the working example. That prints 2. About the last statement, just tried it and the issue still remains :/ — mCalado, Jul 16 '21 at 11:49
Did u try updating GPU driver, CUDA, TensorFlow version? @mCalado — aravinda_gn, Jul 19 '21 at 04:09

score 4 · Accepted Answer · answered Jul 16 '21 at 07:14

It is supposed to run in single gpu (probably the first gpu, GPU:0) for any codes that are outside of mirrored_strategy.run(). Also, as you want to have the gradients returned from replicas, mirrored_strategy.gather() is needed as well.

Besides these, a distributed dataset must be created by using mirrored_strategy.experimental_distribute_dataset. Distributed dataset tries to distribute single batch of data across replicas evenly. An example about these points is included below.

model.fit(), model.predict(),and etc... run in distributed manner automatically just because they've already handled everything mentioned above for you.

Example codes:

mirrored_strategy = tf.distribute.MirroredStrategy()
print(f'using distribution strategy\nnumber of gpus:{mirrored_strategy.num_replicas_in_sync}')

dataset=tf.data.Dataset.from_tensor_slices(np.random.rand(64,224,224,3)).batch(8)

#create distributed dataset
ds = mirrored_strategy.experimental_distribute_dataset(dataset)

#make variables mirrored
with mirrored_strategy.scope():
  resnet50=tf.keras.applications.resnet50.ResNet50()

def step_fn(pre_images):
  with tf.GradientTape(watch_accessed_variables=False) as tape:
       tape.watch(pre_images)
       outputs = resnet50(pre_images)[:,0:1]
  return tf.squeeze(tape.batch_jacobian(outputs, pre_images))

#define distributed step function using strategy.run and strategy.gather
@tf.function
def distributed_step_fn(pre_images):
  per_replica_grads = mirrored_strategy.run(step_fn, args=(pre_images,))
  return mirrored_strategy.gather(per_replica_grads,0)

#loop over distributed dataset with distributed_step_fn
for result in map(distributed_step_fn,ds):
  print(result.numpy().shape)

Hi @Laplace Ricky, thanks a lot for your help. I just tried it and this seems to work! Congratulations! I have had this problem for a very long time and can't thank you enough! — mCalado, Jul 16 '21 at 12:30
I have some questions about your answer: 1. What does ``tf.squeeze(tape.batch_jacobian(outputs, pre_images))`` do and why didn't you use the tape.gradients? I tried with the ``tape.gradient`` and it works! 2. What is the second parameter in ``mirrored_strategy.gather(per_replica_grads,0)`` and why zero? — mCalado, Jul 16 '21 at 12:32
Since the bounty is expiring and the solution works, I am giving you the reward. Nevertheless, I would like to work on the answer a little bit more. — mCalado, Jul 16 '21 at 13:14
Replies to your questions, 1. I am just giving an arbitrary example here as I don't know what gradients you want. My example computes the gradients of the first output(among 1000 outputs from resnet50) with respect to inputs. Regarding `tape.gradient`, we usually pass a scalar to its first argument for the sake of clarity, but if you think that it's giving the gradients you want, just go ahead. 2. as the distributed dataset divides the input data to replicas along the first dimension, `mirrored_strategy.gather` joins the output data back along the first dimension. — Laplace Ricky, Jul 16 '21 at 17:32

Tensorflow - Multi-GPU doesn’t work for model(inputs) nor when computing the gradients

1 Answers1

Linked