Increasing epoch length during iterative Resnet model development with Ktrain

Question

I have a script that performs 5-fold cross validation on an image set with the pretrained Resnet50 model using Ktrain, which is just a wrapper for Tensorflow Keras. A model for each fold is trained using 30 epochs, and the CV is repeated 5 additional times.

The first fold of the first run trains fast enough for my purposes, about 6 minutes per epoch:

Epoch 1/10
288/288 [==============================] - 370s 1s/step - loss: 13.0123 - mse: 13.0123 - val_loss: 2.8116 - val_mse: 2.8116
Epoch 2/10
288/288 [==============================] - 367s 1s/step - loss: 6.2146 - mse: 6.2146 - val_loss: 2.2179 - val_mse: 2.2179

However, the training time for each epoch in successive models is substantially higher, about 30 minutes for a single epoch.

begin training using onecycle policy with max lr of 0.0001...
Epoch 1/10
286/286 [==============================] - 2229s 8s/step - loss: 8.5098 - mse: 8.5098 - val_loss: 2.3128 - val_mse: 2.3128
Epoch 2/10
286/286 [==============================] - 2213s 8s/step - loss: 5.2229 - mse: 5.2229 - val_loss: 2.4311 - val_mse: 2.4311

At the end of each fold I use the Ktrain function release_gpu_memory, defined as:

def release_gpu_memory(device=0):
    """
    ```
    Relase GPU memory allocated by Tensorflow
    Source:
    https://stackoverflow.com/questions/51005147/keras-release-memory-after-finish-training-process
    ```
    """
    from numba import cuda

    K.clear_session()
    cuda.select_device(device)
    cuda.close()
    return

A common solution I see is to use the Keras clear_session() function, which this function includes. However, it appears not to be helping. What can I do to keep a consistent training time across each iteration? Below is my script:

import os
import ktrain
from ktrain import vision as vis
from ktrain.vision.data import images_from_df
from ktrain.core import release_gpu_memory
import pandas as pd
from pprint import pprint
import glob2
from sklearn.model_selection import (GroupShuffleSplit, StratifiedGroupKFold)
from sklearn.model_selection import GroupKFold
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import numpy as np
import tensorflow as tf
import multiprocessing

#############################
# Initial settings excluded #
#############################

# Intialize results df
results = labels.copy()

for r in range(runs):

    # Generate splits
    gkf = GroupKFold(n_splits=5)
    
    for i, (train_index, val_index) in enumerate(gkf.split(filtered, groups=filtered[group]), 1): # start at fold 1, not 0
        
        # Create training and validation data sets and image generators
        train, val = filtered.iloc[train_index], filtered.iloc[val_index]
        (train_img, val_img, preproc) = images_from_df(train_df=train, image_column='id', label_columns='DIFF',directory=f"images/{expt}", suffix='.tif', 
                                                       val_df=val, is_regression=True, target_size=dim, color_mode='rgb')
        
        # Create model
        model = vis.image_regression_model(name='pretrained_resnet50', train_data=train_img, val_data=val_img, 
                                           freeze_layers=None, metrics=['mse'])

        learner = ktrain.get_learner(model=model, train_data=train_img, val_data=val_img, 
                                     workers=multiprocessing.cpu_count()-1, use_multiprocessing=False, batch_size=64)
        
        # Train model
        print(f'Run {r+1} Fold {i}')
        
        learner.fit_onecycle(1e-4, epochs)
        
        # PLot training and validation loss
        learner.plot()
        
        # Create Predictor instance
        predictor = ktrain.get_predictor(learner.model, preproc)

        def predict_diff(row):
            id = row['id']
            fname = f'images/{expt}/{id}.tif'
            pred = round(predictor.predict_filename(fname, return_proba=True)[0])
            return pred

        mask = results.id.isin(val.id)

        results.loc[mask, f'Run_{r+1}'] = results[mask].apply(lambda row: predict_diff(row), axis=1)
        
        release_gpu_memory()


results.to_csv("21PLTR-NNN_PED.csv", index=False)

I tried using the built-in release_gpu_memory function provided by Ktrain, which includes the Keras clear_session function. I expected each successive epoch to have similar training times as those in the first fold. However, the training time increase substantially.

Increasing epoch length during iterative Resnet model development with Ktrain

0 Answers0