I want to train a model on multiple GPU one node but with the following code
trategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
# Everything that creates variables should be under the strategy scope.
# In general this is only model construction & `compile()`.
METRICS = [
keras.metrics.TruePositives(name='tp'),
keras.metrics.FalsePositives(name='fp'),
keras.metrics.TrueNegatives(name='tn'),
keras.metrics.FalseNegatives(name='fn'),
keras.metrics.BinaryAccuracy(name='accuracy'),
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc'),
keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
]
residual_model = create_and_compile_model(shape, classes)
train_gen = DataGenerator(X_train_2, y_train_2, 32)
validation_gen = DataGenerator(X_validation_2, y_validation_2, 32)
residual_model_history = residual_model.fit(train_gen, validation_data=validation_gen,
epochs=2,
verbose=1,
shuffle= True,
class_weight=class_weight,
callbacks=[early_stopping]
)
However, it takes a long time to initialize the training process. The following is the output of my code which is running for a long time:
Epoch 1/2
INFO:tensorflow:batch_all_reduce: 113 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 113 all-reduces with algorithm = nccl, num_packs = 1
Training a deep learning model on multiple gpus