Training a model on multiple GPU is very slow

Question

I want to train a model on multiple GPU one node but with the following code

trategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))

# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.
    METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
      keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
        ]
    residual_model = create_and_compile_model(shape, classes)
    
    train_gen = DataGenerator(X_train_2, y_train_2, 32)
    validation_gen = DataGenerator(X_validation_2, y_validation_2, 32)

    
    residual_model_history = residual_model.fit(train_gen, validation_data=validation_gen,
        epochs=2,
        verbose=1,
        shuffle= True,
        class_weight=class_weight,
        callbacks=[early_stopping]
    )

However, it takes a long time to initialize the training process. The following is the output of my code which is running for a long time:

Epoch 1/2
INFO:tensorflow:batch_all_reduce: 113 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 113 all-reduces with algorithm = nccl, num_packs = 1

Training a deep learning model on multiple gpus

Training a model on multiple GPU is very slow

0 Answers0