Unexpected behaviour (inflated results on random-data) in scikit-learn with nested cross-validation

Question

When trying to train/evaluate a support vector machine in scikit-learn, I am experiencing some unexpected behaviour and I am wondering whether I am doing something wrong or that this is a possible bug.

In a very specific subset of circumstances, nested cross-validation using GridSearchCV and SVM, provides inflated predictive results, even with randomly generated data.

For instance, see this code:

from sklearn import svm
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, LeaveOneOut
from sklearn.metrics import roc_auc_score, brier_score_loss
from tqdm import tqdm
import pandas as pd

N = 20
N_FEATURES = 50

param_grid = {'C': [1e-5, 1e-3, 1, 1e3, 1e5]}

scores = []
for z in tqdm(range(100)):
    X = np.random.uniform(size=(N, N_FEATURES))
    y = np.random.binomial(1, 0.5, size=N)
    
    if z < 10:
        y = np.array([0, 1] * int(N/2))
        y = np.random.permutation(y)
    
    for skf_outer in [StratifiedKFold(n_splits=5), LeaveOneOut()]:
        for skf_inner in [5, LeaveOneOut()]:
            for model in [svm.SVC(probability=True), LogisticRegression()]:
                y_pred, y_real = [], []

                for train_index, test_index in skf_outer.split(X, y):
                    X_train, X_test = X[train_index], X[test_index, :]
                    y_train, y_test = y[train_index], y[test_index]

                    clf = GridSearchCV(
                        model, param_grid, cv=skf_inner, n_jobs=-1, scoring='neg_brier_score'
                    )
                    clf.fit(X_train, y_train)
                    predictions = clf.predict_proba(X_test)[:, 1]

                    y_pred.extend(predictions)
                    y_real.extend(y_test)

                scores.append([str(skf_outer), str(skf_inner), str(model), np.mean(y), brier_score_loss(np.array(y_real), np.array(y_pred)), roc_auc_score(np.array(y_real), np.array(y_pred))])

df_scores = pd.DataFrame(scores)
df_scores.columns = ['skf_outer', 'skf_inner', 'model', 'y_label', 'brier', 'auc']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['skf_outer', 'skf_inner', 'model', 'y_0.5']).mean()
print(df_scores)

In the following circumstances:

Both in the inner- and outerloop of the CV, LeaveOneOut() is used
The SVM is used
The y labels are balanced (i.e. the mean of y is 0.5)

The predictions are much better than expected by random chance (AUC>0.9, sometimes even 1, Brier of 0.15 or lower). I can replicate this generating more samples, more features etc - the issue stays the same. Swapping the SVM for LogisticRegression (as shown in the analysis above), leads to expected results (AUC 0.5, Brier of 0.25). And for the other scenario's (no LOO-CV in either inner or outer loop, or a different distribution of y labels), the results are as expected.

Can anyone replicate this? Am I missing something obvious?

I've replicated this with an older version of sklearn (0.24.0) and the newest one (1.2.0).

Unexpected behaviour (inflated results on random-data) in scikit-learn with nested cross-validation

0 Answers0