-4

So I need to generate 1000 data sets with 200 observations in R from this model: model

and use Lasso and Ridge regression for all of them. Then I need to get beta_j coefficients for Lasso and Ridge. Can anyone help? Thank you already!

  • If you do not supply any code attempts along with code-specific questions you most probably will not get any responses. – SteveM Jun 12 '22 at 15:44
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Jun 13 '22 at 13:45

1 Answers1

0

The setup is as you described in the image:

library(magrittr)
library(tidyverse)
library(glmnet)
M <- 9
beta <- c(c(0, 3, 2, 1, 0.5, 0.3),
          rep(0, 10 - 6))
beta <- beta[-1] #glmnet contains the intercept
sigma <- diag(M) + 0.5 - 0.5 * diag(M)
sigma
N <- 200
G <- 1000

Now, to make the X and the right beta:

Xj <- mvtnorm::rmvnorm(n = N, sigma = sigma) %>%
  set_colnames(paste0("x_", seq_len(ncol(.))))
# X <- cbind(intercept = 1, Xj) # glmnet contains the intercept
X <- Xj
epsilon <- rnorm(n = N, sd = 0.5)
beta %>% length
X %>% ncol()
y <- tcrossprod(beta, X) + epsilon
y

For each dataset, to model estimates has to be found:

list(
  lasso =
    glmnet::cv.glmnet(
      X, y, family = "gaussian",
      alpha = 1,
      intercept = FALSE
    ),
  ridge =
    glmnet::cv.glmnet(
      X, y, family = "gaussian",
      alpha = 0,
      intercept = FALSE
    )
) %>%
  print() %>%
  map_df(. %>% coef() %>% as.matrix() %>% t() %>% as_tibble(), .id = "type")

Now, one could use replicate but the number of datasets is very large. We will have to use parallel programming here...

library(furrr)
plan(multisession, workers = 4)

seq_len(G) %>%
# seq_len(50) %>%
  furrr::future_map_dfr(
    ~ {
      Xj <- mvtnorm::rmvnorm(n = N, sigma = sigma) %>%
        set_colnames(paste0("x_", seq_len(ncol(.))))
      # X <- cbind(intercept = 1, Xj) # glmnet contains the intercept
      X <- Xj
      epsilon <- rnorm(n = N, sd = 0.5)
      y <- tcrossprod(beta, X) + epsilon
      list(
        lasso =
          glmnet::cv.glmnet(
            X, y, family = "gaussian",
            alpha = 1,
            intercept = FALSE,
            parallel = FALSE
          ),
        ridge =
          glmnet::cv.glmnet(
            X, y, family = "gaussian",
            alpha = 0,
            intercept = FALSE,
            parallel = FALSE
          )
      ) %>%
        # print() %>%
        map_df(. %>%
                 coef() %>%
                 as.matrix() %>%
                 t() %>%
                 as_tibble(), .id = "type") %>% 
        mutate(rep = .x)
    },
    .progress = TRUE,
    .options = furrr_options(seed = TRUE)
  ) ->
  results

This will give a progress-bar, and a reps column that ties with dataset belongs to which model estimates.

Let us try to summarise these results somehow:


results %>% 
  glimpse() %>% 
  pivot_longer(c(`(Intercept)`, starts_with("x_")), 
               names_to = "parameter", values_to = "estimate") %>%
  glimpse() %>% 
  # ggplot(aes(estimate, group = interaction(type, parameter))) + 
  ggplot(aes(estimate)) + 
  
  geom_vline(data = tibble(true_beta = beta, parameter = paste0("x_", 1:9)) %>% 
               add_row(true_beta = 0, parameter = "(Intercept)"),
             aes(xintercept = true_beta)) + 
  
  # geom_density() +
  stat_bin(geom = "step", aes(y = after_stat(density))) +
  facet_grid(type ~ parameter, scales = "free") + 
    
  ggpubr::theme_pubclean()

For each parameter, there are a bunch of estimates, and they are then plotted as a histogram, and then the true values are vertical lines:

The results are quite surprising to me atleast:

Histogram of the estimates from the datasets, and the true value super imposed

Instead of coef one can use glmnet::coef.glmnet, and provide s = c("lambda.1se", "lambda.min"). Just for fun, here's how the plot would look if both of these hyper-parameter lambdas was used:

Mossa
  • 1,656
  • 12
  • 16