-1

I am trying to learn modified UNET model which is a lighter version of image segmentation at my office. I use Keras library and searched some ways to utilize company's 2 GPUs and try to use GPU memory efficiently.

I found 3 ways to do so but I still face the GPU memory shortage errors and I could see soooo unbalanced GPU memory usage while learning from both devices. First one was 99% used while the other one is almost 20% used.

I want to know why it happens and how I can fix this.

These are what I've tried on my model.

  1. Set available physical devices(which is GPUs) memory growth
physical_devices = tf.config.list_physical_devices('GPU')
print('the # of available physical gpu devices is ', len(physical_devices))
for device in physical_devices:
    tf.config.experimental.set_memory_growth(device, True)

Above code block is what I've done before I run deep learning and I checked 2 GPUs are available.

  1. Mirrored Strategy This is one of a tool that tensorflow offers to utilize multiple gpu at once. I read the docs of it and found out how to use.
define MakeUnet():
    # I defined modified UNET here
    blah blah blah
    model = Model(inputs=[inputs], outputs=[outputs])
    # model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
    model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=1e-4), metrics=['accuracy'])
    # model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=1e-4), metrics=['accuracy'])
    model.summary()
    return model

# declare Mirrored Strategy
mirrored_strategy = tf.distribute.MirroredStrategy()

    with mirrored_strategy.scope():
        model = self.MakeUnet()
        print('Fitting model...')
        early_stopping = callbacks.EarlyStopping(monitor='val_accuracy',min_delta=0.0009,patience=32,verbose=1,mode='auto')
        reduce_lr = callbacks.ReduceLROnPlateau(monitor='val_accuracy',factor=0.1,patience=20,min_lr=1e-8)


        checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
        callbacks_list = [checkpoint,early_stopping]
    history=model.fit(train_img, label_img, BATCH, EPOCHS, verbose=1, validation_split=0.05, shuffle=True,callbacks=callbacks_list)

Since the docs of tensorflow mirrored strategy said, anything that makes variables should be in the 'with mirrored_strategy.scope()' block so I define the model and call the function inside the scope. Even if I missed something which supposed to be in the box, the docs said it will automatically recognize it and run it within scope.

  1. Mini Batch

I was using Keras fit function, so I set small batch size so that I can keep my GPU from running out of memory no matter how much I have training data. But I always face gpu memory shortage error whenever I try to learn more than around 6000 datas even though I set small Batch size.

I am really confused. I do think 1st way is working well but I don't think 2nd and 3rd way is working properly becuase I saw both GPUs memory usage was really unbalanced and I keep face the GPU memory shortage error. Please help me....

  • What GPUs do you have, and can you post the actual error message? – Dr. Snoopy Aug 09 '23 at 09:09
  • It seems to me that the issue is not about GPU, but more about the code used to train the model. Since you said that after 6000 datas OOM was raised, then some data storage in RAM should be happening like after each epoch. So what are you using to load the data, a generator or TF dataset. – user2586955 Aug 11 '23 at 14:12

0 Answers0