0

What could be the reason for this?

desertnaut
  • 57,590
  • 26
  • 140
  • 166

2 Answers2

2

There is not any guarantee that Bayesian optimization will provide optimal hyperparameter values; quoting from the definitive textbook Deep Learning, by Goodfellow, Bengio, and Courville (page 430):

Currently, we cannot unambiguously recommend Bayesian hyperparameter optimization as an established tool for achieving better deep learning results or for obtaining those results with less effort. Bayesian hyperparameter optimization sometimes performs comparably to human experts, sometimes better, but fails catastrophically on other problems. It may be worth trying to see if it works on a particular problem but is not yet sufficiently mature or reliable.

In other words, it is actually just a heuristic (like grid search), and what you report does not necessarily mean that you are doing something wrong or that there is a problem with the procedure to be corrected...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
1

I would like to extend a perfect @desertnaut answer by a small intuition what could go wrong and how one can improve Bayesian optimization. Bayesian optimization usually use some form of computation of distance (and correlation) between points (hyperparameters). Unfortunately - usually it is close to impossible to impose such geometrical structure on the parameter space. One of important issues connected to this problem is to impose a Lipshitz or linear dependency between optimized value and hyperparameters. To understand that in more details let us have a look at:

Integer(50, 1000, name="estimators")

parameter. Let us inspect how adding 100 estimators could change the behavior of optimization problem. If we add 100 estimators to 50 - we will triple the number of estimators and would probably significantly increase the expressive power. How ever changing from 900 to 1000 should not be as important. So if the optimization process start with - let's say 600 estimators as a first guess - it would notice that changing estimators by approximately 50 is not changing a lot so it would skip optimizing this hyper-parameter (as it assumes quasi continuous-linear dependency). This might seriously harm the exploration process.

In order to overcome this issue it is better to use some sort of log distribution for this parameter. Similar trick was applied e.g. to learning_rate parameter.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120