How to prevent one fold to perform a lot worse than the other 9 in 10-fold cross validation for CNN classification

Question

I'm currently working on a 2D CNN in Keras for MRI classification. The class ratio is about 60/40, I have 155 patients, each with one MRI consisting of around 180 slices, the input of the CNN is a slice of an MRI image (256*256 px) (so input in total is ~27900 images, each 256*256 pixels).

I tested different models and always evaluated them with shuffled stratified 10 fold cross validation and an EarlyStopping monitor and they all performed very well, around 95% to 98% validation accuracy. But everytime, one or two folds perform a lot worse then the other ones (70% to 80% validation accuracy). Since the folds are randomized I would expect the folds to all perform equally well.

Can somebody explain how this could happen and how to prevent it?

Plots for accuracy and loss:

Train accuracy and validation accuracy

Train loss and validation loss

This is part of one of the models:

num_classes = 2
img_size = 256
batch_size = 200

# Because of EarlyStopping monitor, the number of epochs doesn't really matter
num_epochs = 1000

kfold_splits = 10
skf = StratifiedKFold(n_splits=kfold_splits, shuffle=True)

# Here the data is split 
for index, (train_index, test_index) in enumerate(skf.split(x_data_paths, y_data_paths)):

    x_train, x_test = np.array(x_data_paths)[train_index.astype(int)], np.array(x_data_paths)[test_index.astype(int)]
    y_train, y_test = np.array(y_data_paths)[train_index.astype(int)], np.array(y_data_paths)[test_index.astype(int)]

    training_batch_generator = BcMRISequence(x_train, y_train_one_hot, batch_size)
    test_batch_generator = BcMRISequence(x_test, y_test_one_hot, batch_size)

    # region Create model (using the functional API)
    inputs = Input(shape=(img_size, img_size, 1))
    conv1 = Conv2D(64, kernel_size=5, strides=1, activation='relu')(inputs)
    pool1 = MaxPooling2D(pool_size=3, strides=(2, 2), padding='valid')(conv1)
    conv2 = Conv2D(32, kernel_size=3, activation='relu')(pool1)
    pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
    conv3 = Conv2D(16, kernel_size=3, activation='relu')(pool2)
    pool3 = MaxPooling2D(pool_size=(2, 2))(conv3)
    flat = Flatten()(pool3)
    hidden1 = Dense(10, activation='relu')(flat)
    output = Dense(num_classes, activation='softmax')(hidden1)
    model = Model(inputs=inputs, outputs=output)

That's a lot of images you have so there shouldn't be any reason to use k-fold, just use a large model like `Resnet50` and don't use cross-validation. — Natthaphon Hongcharoen, Sep 10 '19 at 13:52
That's true, but I'd still like to know how this could happen since this could maybe indiciate that there is an underlying problem, don't you think? — Sinraw, Sep 10 '19 at 13:54
There're many suspect for the model to go bad in some data but look at your code it most likely your model is too small and doesn't have `Batch Normalization` nor `Dropout`, it's really easy to overfitting. — Natthaphon Hongcharoen, Sep 10 '19 at 13:56
Another suspect I can think of is `batch_size` is too large, this sometimes cause overfitting too in my experience. — Natthaphon Hongcharoen, Sep 10 '19 at 13:58
But the training loss and validation loss graph indicates that it isn't overfitting since they are both still decreasing, doesn't it? — Sinraw, Sep 10 '19 at 13:58
I just look at your images but the yellow and green lines too are started to converge at 20s epoch instead of right after training start. Lets see if a proper model makes any difference. MobileNet would do it in my oppinion. Also, load Imagenet pre-trained weights helps training better though. — Natthaphon Hongcharoen, Sep 10 '19 at 14:07
If it works then the reason maybe something to do with the randomized initial weights are bad, this can be solve by use the already fine-tuned weights(even the tasks are completely difference). — Natthaphon Hongcharoen, Sep 10 '19 at 14:11

score 0 · Answer 1 · answered Sep 11 '19 at 09:04

Just in case anyone else stumbles upon my question: By using stratified shuffle split with 10 iterations instead of 10 fold cross validation I got rid of the outlier fold. My guess is this bad fold appeared because of some kind of "batch effect" (the 10 fold cross validation doesn't shuffle the MRI slices, only the patients). The shuffle split on the other hand does shuffle the entire data set before splitting into train and test, therefore avoiding "bad" patients to appear together in one fold.

These are the new plots, in case anyone is interested. Same model as before, just shuffle split instead of k fold cross validation.

Train accuracy and validation accuracy

Train loss and validation loss

How to prevent one fold to perform a lot worse than the other 9 in 10-fold cross validation for CNN classification

1 Answers1