0

I'm trying to understand what I'm supposed to do here. i already applied lasso and ridge regression, found the optimal lambda and refitted the models. but i don't understand what am i supposed to do after that.

QUESTION :

". For the Diabetes data set (uploaded to Moodle), we wish to use the 10 features (X variables) to predict prog (Y), a quantitative assessment of disease progression one year after baseline. Variable prog is the last column in the data. Before you fit ridge regression and LASSO, do not forget to standardize all the X variables so that they would be on the same scale. Use ridge regression and LASSO to predict prog. In both regressions, choose the optimal lambda using cross-validation. The optimal lambda will correspond to the minimum CV error. For the optimal lambda, refit the ridge and LASSO models. Run bootstrap with 1000 bootstrap replications in order to obtain standard errors (SE) for the estimates of regression coefficients. For each bootstrap replication, you'll have to refit a ridge and LASSO model and aggregate the estimates of regression coefficients. Then, the estimates of SEs of regression coefficients will the SDs of bootstrap estimates. "

Phil
  • 7,287
  • 3
  • 36
  • 66
Maisaa
  • 1

1 Answers1

2

Unfortunately I don't have enough reputation to comment, so here's an answer.

The standard deviation of a sampling distribution for a sample statistic is the standard error. By bootstrapping, you're generating a sampling distribution for your estimate. The standard error would simply be the standard deviation of those estimates.

Since you've done cross validation, I assume you're familiar with bootstrapping. They are both resampling methods, but bootstrapping uses sampling with replacement whereas cross validation samples without replacement (this is a very simplified take - naturally they have different applications). You would refit your models with the newly generated sample many times over and get the estimated coefficients for each sample. The standard deviation of those estimates is the standard error.

Here's an example using the mean with data we generate since you haven't provided code or data for your own example.

set.seed(360)
x <- rnorm(1000)
xhat <- mean(x)

## we'll do 1000 replicates
B <- 1000
## here we generate 1000 samples and take the mean of each one of those samples
xhat_star <- apply(matrix(sample(x, rep=TRUE, B*length(x)), nrow=B), 1, mean)
standard_error <- sd(xhat_star)

That being said, I'm not sure how meaningful standard errors derived using this method would be for penalized regression. You've already biased the estimates and reduced the variance. It would likely be a misleading estimate of precision. See page 18 of the penalized package for a note on standard errors in penalized regression.

TrainingPizza
  • 1,090
  • 3
  • 12
  • 1
    There's no reason to worry about not having the reputation to comment if you're able to offer a comprehensive answer like this. The concern with commenting is usually people instead posting answers that make no attempt to (directly) provide an answer, such as "I have this problem to", or "Can you provide more detail?" or "Did you review this documentation?". You've obviously done far more than that, and this is a welcome contribution. – Jeremy Caney Nov 27 '21 at 01:17
  • Noted, thank you! – TrainingPizza Nov 29 '21 at 17:00