How to compare baseline and GridSearchCV results fair?

Question

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))

So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:

from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)

Here, we have accuracy based on validation sample.

My questions are:

Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?

score 4 · Accepted Answer · answered Nov 04 '21 at 21:17

No, they aren't comparable.

Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.

The grid searched model is at a disadvantage because:

It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).

So your score for the grid search is going to be worse than your baseline.

Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.

So how should one go about conducting a fair comparison?

Split your training data for both models.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Fit your models using X_train.

# fit baseline
baseline.fit(X_train, y_train)

# fit using grid search
best_model.fit(X_train, y_train)

Evaluate models against X_test.

# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test,  baseline_pred))

# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

How to compare baseline and GridSearchCV results fair?

1 Answers1