Why during model training chosen is different hyperparameter than that coming from resampling?

Question

During resampling, the max_depth parameter with values of 5 and 9 is tested. However, while training, a completely different value of 10 is used. I expected that during training the parameter returning the smallest RMSE will be set. In this case, a completely different parameter's value was chosen.

library("mlr3")
library("paradox")
library("mlr3learners")
library("mlr3tuning")
library("data.table")

set.seed(10)

x1 = 1:100
x2 = 2 * x1
y = x1^2 - x2 + rnorm(100)

data = data.table(
   x1 = x1,
   x2 = x2,
   y = y
)

task = TaskRegr$new("task", backend = data, target = "y")

lrn_xgb = mlr_learners$get("regr.xgboost")

ps = ParamSet$new(
   params = list(
      ParamInt$new(id = "max_depth", lower = 4, upper = 10)
   ))

at = AutoTuner$new(learner = lrn_xgb, 
                   resampling = rsmp("cv", folds = 2),
                   measures = msr("regr.rmse"), 
                   tune_ps = ps,
                   terminator = term("evals", n_evals = 1),
                   tuner = tnr("random_search"))

resampling_outer = rsmp("cv", folds = 2)

rr = resample(task = task, learner = at, resampling = resampling_outer)
#> max_depth = 5
#> max_depth = 9

at$train(task)
#> max_depth = 10

Session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)

Matrix products: default

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] mlr3learners_0.1.3 mlr3tuning_0.1.0   data.table_1.12.2 
[4] paradox_0.1.0      mlr3_0.1.3

loaded via a namespace (and not attached):
 [1] lgr_0.3.3        lattice_0.20-38  mlr3misc_0.1.4  
 [4] digest_0.6.21    crayon_1.3.4     grid_3.6.1      
 [7] R6_2.4.0         backports_1.1.4  magrittr_1.5    
[10] stringi_1.4.3    uuid_0.1-2       Matrix_1.2-17   
[13] checkmate_1.9.4  xgboost_0.90.0.2 tools_3.6.1     
[16] compiler_3.6.1   Metrics_0.1.4

score 3 · Accepted Answer · answered Oct 10 '19 at 21:16

Everything that happens is correct. The point is this: The AutoTuner prepends the training algorithm of xgboost with a tuning method, which finds (optimal? good? well-performing?) hyperparameters, then sets them in the learner, then trains the model through a final call of the training algorithm.

You can envision this as

Data -> [Split-Data] -> [Tune] -(opt.HPs, Data) -> [Train] -> Model

If you want a (only slightly) less ugly looking pic for this, have a look at my lecture at:

https://compstat-lmu.github.io/lecture_i2ml/articles/content.html (see day5, tuning and nested resampling)

Now in your code above 3 passes of the above pipeline happen. 2 in your 2-fold CV, 1 at the end. In each pass, a tuning call happens. On different data. So there is NO GUARANTEE that the 3 optimal HP configs are the same. First of all, the first 2 are samples of data from the sample underlying data distribution and are of the same size. So quite a lot is "the same", but they are still stochastic samples. So results can differ. Especially when there are many HP configs with nearly the same performance as the optimal HP config. And data is small. And the tuner is pretty stochastic. (NB: All this is true for your example....) For the third tuning run, the underlying data distribution is still the same, but now the training data is even a bit larger (double in size in your case, due to 2CV). That can lead also to different results.

In general, you can check yo for at least roughly similar results of tuning, as you did above, and start to "worry" / inspect / use your human learning instrument (brain) why the tuning is actually "unstable". But in your case, the data is so small, and the experiment is more like a "toy experiment" that I don't think it makes sense to ponder this here. Why it's technically not a bug I explained above.

Here is another hopefully helpful analogon: Forget the AutoTuner, run exactly the same code with a simple linear regression. You run a 2CV with it, and you fit it on the complete data. 3 "beta" parameter vectors are created for the linear model. Do you expect them to be all the same? No. Would you worry if they are all super-different? Potentially.

My last example and your code are very much related. My last example I would call "1st level learning". And we optimize the risk function of the linear model, numerically. Tuning is "2nd level learning". It still optimizes parameters, hyperparameters, or 2nd-level-parameters. And it optimizes a different "risk": The cross-validated-error. And uses other optimization techniques, maybe random search, maybe Bayesian optimization. But on an abstract level, both techniques are very similar.

This comparison helped me a lot as a student and its also the reason why mlr looks like it does, to a certain degree, regarding the AutoTuner.

Why during model training chosen is different hyperparameter than that coming from resampling?

1 Answers1