Iterating GridSearchCV over multiple datasets gives identical result for each

Question

I am trying to perform grid search in Scikit-learn for a specific algorithm with different hyperparameters over multiple train datasets stored into a dedicated dictionary. First, I call the different hyperparams and the model to be used:

scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
grid_search = {}

for key in X_train_d.keys():
    cv = StratifiedKFold(n_splits=5, random_state=1)
    model = XGBClassifier(objective="binary:logistic", random_state=42)
    space = dict()
    space['n_estimators']=[50] # 200
    space['learning_rate']= [0.5] #0.01, 0.3, 0.5
    grid_search= GridSearchCV(model, space, scoring=scoring, cv=cv, n_jobs=3, verbose=2, refit='balanced_accuracy')

Then, I create an empty dictionary that should be populated with as many GridSearchCV objects as X_train_d.keys(), via:

grid_result = {}    
for key in X_train_d.keys():
    grid_result[key] = grid_search.fit(X_train_d[key], Y_train_d[key])

Finally, I create as many datasets as the existing keys reporting info on scoring etc. via:

df_grid_results = {}
for key in X_train_d.keys():
    df_grid_results[key]=pd.DataFrame(grid_search.cv_results_)
    df_grid_results[key] = (
    df_grid_results[key]
    .set_index(df_grid_results[key]["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('kernel')
    )

All is working "perfectly" - in the sense that no error is shown - except that when I inspect either the different GridSearchCV objects or the df_grid_results datasets, I see that results are all identical as if the models were fit on the same dataset over and over again, while the X_train_d and Y_train_d dictionaries contain different datasets.

Of course, when I fit a model individually, like:

model1_cv = grid_search.fit(X_train_d[1], Y_train_d[1])
model2_cv = grid_search.fit(X_train_d[2], Y_train_d[2])

results differ as expected.

I feel like I am missing something really stupid and obvious here. Anybody can help? Thanks!

Welcome to stack overflow, please give a working piece of code so we can try and help. Here X_train_d is not defined. Here it seems that you use overwrite the grid_search variable each time, and so it keeps only the last one. This could explain your result. You have to define and use grid_search in the same loop before proceeding to the next one. — Malo, Dec 19 '21 at 17:11

score 0 · Answer 1 · answered Dec 19 '21 at 17:10

Here it seems that you use overwrite the grid_search variable each time, and so it keeps only the last one. This could explain your result. You have to define and use grid_search in the same loop before proceeding to the next one. Please provide working code and data, and I will edit your code.

The idea is like this:

scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
grid_search = {}
grid_result = {}    

for key in X_train_d.keys():
    cv = StratifiedKFold(n_splits=5, random_state=1)
    model = XGBClassifier(objective="binary:logistic", random_state=42)
    space = dict()
    space['n_estimators']=[50] # 200
    space['learning_rate']= [0.5] #0.01, 0.3, 0.5
    grid_search= GridSearchCV(model, space, scoring=scoring, cv=cv, n_jobs=3, verbose=2, refit='balanced_accuracy')
    grid_result[key] = grid_search.fit(X_train_d[key], Y_train_d[key])
    
df_grid_results = {}
for key in X_train_d.keys():
    df_grid_results[key]=pd.DataFrame(grid_search.cv_results_)
    df_grid_results[key] = (
    df_grid_results[key]
    .set_index(df_grid_results[key]["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('kernel')
    )

Thank Malo. Unfortunately, your solution leads to the same problem: all results are identical. Also, I cannot share any data for confidentiality on this project. — Giant Steps, Dec 20 '21 at 10:09

score 0 · Accepted Answer · answered Dec 19 '21 at 17:42

As pointed out by Malo the problem is that in the last loop you are copy-pasting the grid search results for the last dataset in all data frames. However, the multiple loops in your code are not really needed, you can simplify your code to run only one loop and to save the results directly in a data frame as follows:

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV

# features datasets
X_train_d = {
    'd1': np.random.normal(0, 1, (100, 3)), 
    'd2': np.random.normal(0, 1, (100, 5))
}

# labels datasets
Y_train_d = {
    'd1': np.random.choice([0, 1], 100), 
    'd2': np.random.choice([0, 1], 100)
}

# parameter grid
param_grid = {'n_estimators': [50, 100], 'learning_rate': [0.3, 0.5]}

# evaluation metrics
scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']

# cross-validation splits
cv = StratifiedKFold(n_splits=5)

# results data frame
df_grid_results = pd.DataFrame()

for key in X_train_d.keys():

    # run the grid search
    grid_search = GridSearchCV(
        estimator=XGBClassifier(objective='binary:logistic', random_state=42), 
        param_grid=param_grid, 
        scoring=scoring, 
        cv=cv, 
        n_jobs=3, 
        verbose=2, 
        refit='balanced_accuracy'
    )
    
    grid_search.fit(X_train_d[key], Y_train_d[key])
    
    # save the grid search results in the data frame
    df_temp = pd.DataFrame(grid_search.cv_results_)
    df_temp['dataset'] = key
    
    df_grid_results = df_grid_results.append(df_temp, ignore_index=True)

df_grid_results = df_grid_results.set_index(df_grid_results['params'].apply(lambda x: '_'.join(str(val) for val in x.values()))).rename_axis('kernel')

print(df_grid_results[['dataset', 'mean_test_accuracy', 'mean_test_balanced_accuracy', 'mean_test_f1', 'mean_test_precision', 'mean_test_recall']])
#         dataset  mean_test_accuracy  mean_test_balanced_accuracy  mean_test_f1  mean_test_precision  mean_test_recall  
# kernel                                                             
# 0.3_50       d1                0.40                     0.403232      0.349067             0.399953          0.335556  
# 0.3_100      d1                0.38                     0.382323      0.356022             0.368983          0.355556  
# 0.5_50       d1                0.43                     0.429596      0.351857             0.391209          0.335556  
# 0.5_100      d1                0.41                     0.409596      0.342767             0.365812          0.335556  
# 0.3_50       d2                0.55                     0.540025      0.448419             0.501948          0.436111
# 0.3_100      d2                0.57                     0.556692      0.462381             0.515996          0.436111  
# 0.5_50       d2                0.62                     0.607449      0.536695             0.587857          0.502778  
# 0.5_100      d2                0.64                     0.629672      0.571682             0.607857          0.547222

Iterating GridSearchCV over multiple datasets gives identical result for each

2 Answers2

Linked