0

I have a script that performs 5-fold cross validation on an image set with the pretrained Resnet50 model using Ktrain, which is just a wrapper for Tensorflow Keras. A model for each fold is trained using 30 epochs, and the CV is repeated 5 additional times.

The first fold of the first run trains fast enough for my purposes, about 6 minutes per epoch:

Epoch 1/10
288/288 [==============================] - 370s 1s/step - loss: 13.0123 - mse: 13.0123 - val_loss: 2.8116 - val_mse: 2.8116
Epoch 2/10
288/288 [==============================] - 367s 1s/step - loss: 6.2146 - mse: 6.2146 - val_loss: 2.2179 - val_mse: 2.2179

However, the training time for each epoch in successive models is substantially higher, about 30 minutes for a single epoch.

begin training using onecycle policy with max lr of 0.0001...
Epoch 1/10
286/286 [==============================] - 2229s 8s/step - loss: 8.5098 - mse: 8.5098 - val_loss: 2.3128 - val_mse: 2.3128
Epoch 2/10
286/286 [==============================] - 2213s 8s/step - loss: 5.2229 - mse: 5.2229 - val_loss: 2.4311 - val_mse: 2.4311

At the end of each fold I use the Ktrain function release_gpu_memory, defined as:

def release_gpu_memory(device=0):
    """
    ```
    Relase GPU memory allocated by Tensorflow
    Source:
    https://stackoverflow.com/questions/51005147/keras-release-memory-after-finish-training-process
    ```
    """
    from numba import cuda

    K.clear_session()
    cuda.select_device(device)
    cuda.close()
    return

A common solution I see is to use the Keras clear_session() function, which this function includes. However, it appears not to be helping. What can I do to keep a consistent training time across each iteration? Below is my script:

import os
import ktrain
from ktrain import vision as vis
from ktrain.vision.data import images_from_df
from ktrain.core import release_gpu_memory
import pandas as pd
from pprint import pprint
import glob2
from sklearn.model_selection import (GroupShuffleSplit, StratifiedGroupKFold)
from sklearn.model_selection import GroupKFold
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import numpy as np
import tensorflow as tf
import multiprocessing

#############################
# Initial settings excluded #
#############################

# Intialize results df
results = labels.copy()

for r in range(runs):

    # Generate splits
    gkf = GroupKFold(n_splits=5)
    
    for i, (train_index, val_index) in enumerate(gkf.split(filtered, groups=filtered[group]), 1): # start at fold 1, not 0
        
        # Create training and validation data sets and image generators
        train, val = filtered.iloc[train_index], filtered.iloc[val_index]
        (train_img, val_img, preproc) = images_from_df(train_df=train, image_column='id', label_columns='DIFF',directory=f"images/{expt}", suffix='.tif', 
                                                       val_df=val, is_regression=True, target_size=dim, color_mode='rgb')
        
        # Create model
        model = vis.image_regression_model(name='pretrained_resnet50', train_data=train_img, val_data=val_img, 
                                           freeze_layers=None, metrics=['mse'])

        learner = ktrain.get_learner(model=model, train_data=train_img, val_data=val_img, 
                                     workers=multiprocessing.cpu_count()-1, use_multiprocessing=False, batch_size=64)
        
        # Train model
        print(f'Run {r+1} Fold {i}')
        
        learner.fit_onecycle(1e-4, epochs)
        
        # PLot training and validation loss
        learner.plot()
        
        # Create Predictor instance
        predictor = ktrain.get_predictor(learner.model, preproc)

        def predict_diff(row):
            id = row['id']
            fname = f'images/{expt}/{id}.tif'
            pred = round(predictor.predict_filename(fname, return_proba=True)[0])
            return pred

        mask = results.id.isin(val.id)

        results.loc[mask, f'Run_{r+1}'] = results[mask].apply(lambda row: predict_diff(row), axis=1)
        
        release_gpu_memory()


results.to_csv("21PLTR-NNN_PED.csv", index=False)

I tried using the built-in release_gpu_memory function provided by Ktrain, which includes the Keras clear_session function. I expected each successive epoch to have similar training times as those in the first fold. However, the training time increase substantially.

0 Answers0