I am trying to learn modified UNET model which is a lighter version of image segmentation at my office. I use Keras library and searched some ways to utilize company's 2 GPUs and try to use GPU memory efficiently.
I found 3 ways to do so but I still face the GPU memory shortage errors and I could see soooo unbalanced GPU memory usage while learning from both devices. First one was 99% used while the other one is almost 20% used.
I want to know why it happens and how I can fix this.
These are what I've tried on my model.
- Set available physical devices(which is GPUs) memory growth
physical_devices = tf.config.list_physical_devices('GPU')
print('the # of available physical gpu devices is ', len(physical_devices))
for device in physical_devices:
tf.config.experimental.set_memory_growth(device, True)
Above code block is what I've done before I run deep learning and I checked 2 GPUs are available.
- Mirrored Strategy This is one of a tool that tensorflow offers to utilize multiple gpu at once. I read the docs of it and found out how to use.
define MakeUnet():
# I defined modified UNET here
blah blah blah
model = Model(inputs=[inputs], outputs=[outputs])
# model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=1e-4), metrics=['accuracy'])
# model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=1e-4), metrics=['accuracy'])
model.summary()
return model
# declare Mirrored Strategy
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = self.MakeUnet()
print('Fitting model...')
early_stopping = callbacks.EarlyStopping(monitor='val_accuracy',min_delta=0.0009,patience=32,verbose=1,mode='auto')
reduce_lr = callbacks.ReduceLROnPlateau(monitor='val_accuracy',factor=0.1,patience=20,min_lr=1e-8)
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint,early_stopping]
history=model.fit(train_img, label_img, BATCH, EPOCHS, verbose=1, validation_split=0.05, shuffle=True,callbacks=callbacks_list)
Since the docs of tensorflow mirrored strategy said, anything that makes variables should be in the 'with mirrored_strategy.scope()' block so I define the model and call the function inside the scope. Even if I missed something which supposed to be in the box, the docs said it will automatically recognize it and run it within scope.
- Mini Batch
I was using Keras fit function, so I set small batch size so that I can keep my GPU from running out of memory no matter how much I have training data. But I always face gpu memory shortage error whenever I try to learn more than around 6000 datas even though I set small Batch size.
I am really confused. I do think 1st way is working well but I don't think 2nd and 3rd way is working properly becuase I saw both GPUs memory usage was really unbalanced and I keep face the GPU memory shortage error. Please help me....