So I need to generate 1000 data sets with 200 observations in R from this model: model
and use Lasso and Ridge regression for all of them. Then I need to get beta_j coefficients for Lasso and Ridge. Can anyone help? Thank you already!
So I need to generate 1000 data sets with 200 observations in R from this model: model
and use Lasso and Ridge regression for all of them. Then I need to get beta_j coefficients for Lasso and Ridge. Can anyone help? Thank you already!
The setup is as you described in the image:
library(magrittr)
library(tidyverse)
library(glmnet)
M <- 9
beta <- c(c(0, 3, 2, 1, 0.5, 0.3),
rep(0, 10 - 6))
beta <- beta[-1] #glmnet contains the intercept
sigma <- diag(M) + 0.5 - 0.5 * diag(M)
sigma
N <- 200
G <- 1000
Now, to make the X
and the right beta
:
Xj <- mvtnorm::rmvnorm(n = N, sigma = sigma) %>%
set_colnames(paste0("x_", seq_len(ncol(.))))
# X <- cbind(intercept = 1, Xj) # glmnet contains the intercept
X <- Xj
epsilon <- rnorm(n = N, sd = 0.5)
beta %>% length
X %>% ncol()
y <- tcrossprod(beta, X) + epsilon
y
For each dataset, to model estimates has to be found:
list(
lasso =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 1,
intercept = FALSE
),
ridge =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 0,
intercept = FALSE
)
) %>%
print() %>%
map_df(. %>% coef() %>% as.matrix() %>% t() %>% as_tibble(), .id = "type")
Now, one could use replicate
but the number of datasets is very large.
We will have to use parallel programming here...
library(furrr)
plan(multisession, workers = 4)
seq_len(G) %>%
# seq_len(50) %>%
furrr::future_map_dfr(
~ {
Xj <- mvtnorm::rmvnorm(n = N, sigma = sigma) %>%
set_colnames(paste0("x_", seq_len(ncol(.))))
# X <- cbind(intercept = 1, Xj) # glmnet contains the intercept
X <- Xj
epsilon <- rnorm(n = N, sd = 0.5)
y <- tcrossprod(beta, X) + epsilon
list(
lasso =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 1,
intercept = FALSE,
parallel = FALSE
),
ridge =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 0,
intercept = FALSE,
parallel = FALSE
)
) %>%
# print() %>%
map_df(. %>%
coef() %>%
as.matrix() %>%
t() %>%
as_tibble(), .id = "type") %>%
mutate(rep = .x)
},
.progress = TRUE,
.options = furrr_options(seed = TRUE)
) ->
results
This will give a progress-bar, and a reps
column that ties with dataset
belongs to which model estimates.
Let us try to summarise these results somehow:
results %>%
glimpse() %>%
pivot_longer(c(`(Intercept)`, starts_with("x_")),
names_to = "parameter", values_to = "estimate") %>%
glimpse() %>%
# ggplot(aes(estimate, group = interaction(type, parameter))) +
ggplot(aes(estimate)) +
geom_vline(data = tibble(true_beta = beta, parameter = paste0("x_", 1:9)) %>%
add_row(true_beta = 0, parameter = "(Intercept)"),
aes(xintercept = true_beta)) +
# geom_density() +
stat_bin(geom = "step", aes(y = after_stat(density))) +
facet_grid(type ~ parameter, scales = "free") +
ggpubr::theme_pubclean()
For each parameter, there are a bunch of estimates, and they are then plotted as a histogram, and then the true values are vertical lines:
The results are quite surprising to me atleast:
Instead of coef
one can use glmnet::coef.glmnet
, and provide s = c("lambda.1se", "lambda.min")
. Just for fun, here's how the plot would look if both of these hyper-parameter lambda
s was used: