0

Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.

I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.

I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):

ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))

tscv = TimeSeriesSplit(n_splits=5)

param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error', 
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))

It gives me this output:

{'ridge__alpha': 1.2589254117941673}
-4.067235334106922

Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:

param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, 
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))

It gives me this ouput:

{'ridge__alpha': 1e-300}
-2.0906161667718835e-13

Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.

To see if np.logspace could be the culprit, I instead provide just a list of values:

param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error', 
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))

And the output shows that the same problem:

{'ridge__alpha': 0.1}
-2.0419740158869386

And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.

The same issue happens when I try Lasso instead of Ridge. Any thoughts?

Rob
  • 21
  • 4
  • Welcome to Stack Overflow. In your own words, where the code says `scoring='neg_root_mean_squared_error'`, what do you expect that to mean? Did you try using other values for `scoring`? (Did you try checking the documentation to see what other values are possible?) "since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts." I don't know about any of these libraries, but that seems like a strange expectation to me. Can you cite the part of the documentation that leads you to this conclusion? – Karl Knechtel Jun 27 '22 at 01:37
  • Karl, the negative Root MSE thing is my understanding from both my master's degree program, also confirmed by this post here: https://stackoverflow.com/questions/48244219/is-sklearn-metrics-mean-squared-error-the-larger-the-better-negated – Rob Jun 27 '22 at 01:51

1 Answers1

3

This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.

Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Is there a way to print the specific mean cross validation score within grid.cv_results_['mean_test_score'] with respect to the optimal parameter(s) GridSearchCV discovered, so that it's not simply printing out the score when the optimal parameter is applied when the model is fit to the entire (training) dataset? – Rob Jun 27 '22 at 14:04
  • Never mind - found it: It's grid.best_score_ (or search.best_score_ in general) – Rob Jun 27 '22 at 14:39