When cross-validating the elastic net lambda
hyper-parameter using the lambda_search
option, the algorithm may not pick the value of lambda
from the specified grid that minimizes deviance on the validation sample. This occurs also when we set early_stopping = FALSE
, i.e., when one would expect H2O to evaluate all values of lambda
in the grid.
This statement can be checked by cross-validating lambda first using lambda_search = TRUE
in h2o.glm()
, then running a grid search over the same values of lambda using h2o.grid()
and comparing the resulting hyperparameters and validation deviance values. See the R code below.
The issue is closely related to the one pointed out here and mentioned here. What this question adds is the documentation that the cross-validated value of lambda
need not be the one that minimizes validation deviance. I.e., the problem can be more severe than H2O computing up-to the best lambda and then exiting, as stated in the comments here. The issue occurred for me when tuning on one validation sample in a Tweedie glm with log link, I am not sure how specific it is to this setting.
Based on these results, I would tend to always use grid search to determine lambda
. Is this appropriate? Alternatively, is there some option in h2o.glm()
that addresses the issue with lambda_search
?
rm(list = ls())
library(h2o)
library(tweedie)
library(tidyverse)
# Configuration -----------------------------------------------------------
# DGP:
n = 1000
k = 10
phi = 1
const = 0
bet = seq(-1, 1, length.out = k)
power = 1.5
# algorithm
alpha = 0.5
# Generate some data ------------------------------------------------------
set.seed(42)
x = rnorm(n * k) %>%
matrix(nrow = n, dimnames = list(NULL, paste0("x", seq(1, k))))
mu = as.numeric(exp(const + x %*% bet))
dat = x %>%
as_tibble() %>%
mutate(mu = mu,
y = rtweedie(n,
mu = mu,
phi = phi,
power = power),
id = row_number(),
sample = case_when(
id <= n / 2 ~ "train",
TRUE ~ "valid"))
# Initialize H2O ----------------------------------------------------------
h2o.init()
df_h2o_train = dat %>%
filter(sample == "train") %>%
as.h2o()
df_h2o_valid = dat %>%
filter(sample == "valid") %>%
as.h2o()
# Tune lambda -------------------------------------------------------------
# 1. Lambda search
glm_warmstart = h2o.glm(
x = paste0("x", seq(1, k)),
y = "y",
family = "tweedie",
tweedie_variance_power = power,
tweedie_link_power = 0,
training_frame = df_h2o_train,
validation_frame = df_h2o_valid,
alpha = alpha,
lambda_search = TRUE,
early_stopping = FALSE
)
lambda_warmstart = glm_warmstart@model$lambda_best
print(lambda_warmstart) # 0.1501327
# 2. Grid search
hyper_params = list(lambda = glm_warmstart@model$scoring_history$lambda %>%
h2o.asnumeric())
grid_search = h2o.grid("glm",
hyper_params = hyper_params,
x = paste0("x", seq(1, k)),
y = "y",
family = "tweedie",
tweedie_variance_power = power,
tweedie_link_power = 0,
training_frame = df_h2o_train,
validation_frame = df_h2o_valid,
alpha = alpha,
lambda_search = FALSE)
lambda_grid_search = grid_search@summary_table %>%
as_tibble() %>%
head(1) %>%
pull(lambda) %>%
stringr::str_sub(2, -2) %>%
as.numeric()
print(lambda_grid_search) # 0.013
glm_grid_search = h2o.glm(
x = paste0("x", seq(1, k)),
y = "y",
family = "tweedie",
tweedie_variance_power = power,
tweedie_link_power = 0,
training_frame = df_h2o_train,
alpha = alpha,
lambda = lambda_grid_search)
# Compare validation deviance ---------------------------------------------
dat %>%
filter(sample == "valid") %>%
mutate(pred_warmstart = as.vector(h2o.predict(glm_warmstart,
newdata = df_h2o_valid)),
pred_grid_search = as.vector(h2o.predict(glm_grid_search,
newdata = df_h2o_valid)),
deviance_warmstart = tweedie.dev(y, pred_warmstart, power),
deviance_grid_search = tweedie.dev(y, pred_grid_search, power)) %>%
summarise(
mean_deviance_warmstart = mean(deviance_warmstart), # 1.16
mean_deviance_grid_search = mean(deviance_grid_search) # 1.08
)
# Close -------------------------------------------------------------------
h2o.shutdown(prompt = FALSE)