0

I all, I am working on a manuscript. I am using lasso regression (binary) in R (glmnet package) to determine which variables best predict outcome in patients admitted to the hospital from COVID-19 symptoms. I had 247 patients, 101 patients ended up dying and 146 were eventually discharged alive. I have 18 possible predictors and I use lasso to predict outcome (1 = died, 0 = discharged alive). Lasso helps me to identify only the most important predictors while being parsimonious.

When I run my code I get two outputs and I can't interpret them:

Here's what I did:

library(glmnet) library(ISLR2) library(caret) library(ggplot2)

#Define Response Variable

y <- Arturo1$OUTCOME

#Define matrix of 18 predictor variables

x<- data.matrix(Arturo1[, c('AGE', 'SEX', 'SPO2_ADMISSION', 'ARDS_ADMISSION', 'DAYS_ON_VE',    'ADMISSION_WARD', 'INVASIVE_VE', 'HOSPITAL_STAY', 'LEUK', 'NEUT', 'LYMPH', 'NLRATIO', 'EOS', 'PLATELETS', 'CRP', 'DDIMER', 'PLRATIO', 'dNLR')])

LASSO REGRESSION.

lm_lasso <-cv.glmnet(x,y, alpha = 1, Standardize = TRUE, nfolds=10, family="binomial")

LAMDA VALUE for the Best cross-validated lambda

best_lambda <- lm_lasso$lambda.min
best_lambda

Fit final model, get its sum of squared residuals and multiple R-squared of the training data

model_cv <- glmnet(x, y, alpha = 1, lambda = best_lambda, standardize = TRUE)

model_cv y_hat_cv <- predict(model_cv, x) ssr_cv <- t(y - y_hat_cv) %*% (y - y_hat_cv) ssr_cv ##Sum of squared residuals rsq_best_lambda <- cor(y, y_hat_cv)^2 rsq_best_lambda ##R2 of the training data

#Coefficients of best lasso model

lm_lasso <-cv.glmnet(x,y, alpha = 1, lamda=best_lambda, Standardize = TRUE, nfolds=10, family="binomial")
W <- as.matrix(coef(lm_lasso))
W
keep_x <- rownames(W)[W!=0]
keep_x <- keep_x[!keep_x == "(Intercept)"]
x <- x[,keep_x]

#find coefficients of best model (s0)

best_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)
coef(best_model)

Then my output is this:

s1 (Intercept) 5.56237487 AGE 0.03107880 SEX 0.00000000 SPO2_ADMISSION -0.02365379 ARDS_ADMISSION 0.00000000 DAYS_ON_VE 0.00000000 ADMISSION_WARD 0.00000000 INVASIVE_VE 3.79950907 HOSPITAL_STAY -0.01693193 LEUK 0.00000000 NEUT 0.00000000 LYMPH 0.00000000 NLRATIO 0.00000000 EOS 0.00000000 PLATELETS -1.56045077 CRP 0.00000000 DDIMER 0.00000000 PLRATIO 0.00000000 dNLR 1.14534192

7 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) 1.265924008 AGE 0.003592585 SPO2_ADMISSION -0.003396769 INVASIVE_VE 0.731801641 HOSPITAL_STAY -0.006873744 PLATELETS -0.213049905 dNLR 0.162244439

Question 1: I can see the model chose 6 predictors plus intercept, but What is s0 and what is s1?

Question 2: If I clear my screen and I run this all over again, I'm going to get a slightly different lambda and the results above for S0 and s1 are different. Is this normal?

Question 3: I would like to visualize the importance of the different predictor variables. From this weblink https://rforhr.com/lassoregression.html , scrolling down, there is a section on estimating the importance of different predictor variables, using the caret package and varImp function. The visualized plot is something I'd like to do here.

So I do this: varImp(best_model, lambda = 0.008098)

And I get the following:

Overall AGE 0.003592585 SPO2_ADMISSION 0.003396769 INVASIVE_VE 0.731801641 HOSPITAL_STAY 0.006873744 PLATELETS 0.213049905 dNLR 0.162244439

What is this Overall and what is this output? Is this showing me that INVASIVE VE = most important predictor variable in predicting outcome, then PLATELETS are second, dNLR is third, HOSPITAL STAY is 4th, AGE is 5th, SPO2 ADMISSION is 6th most important in estimating the outcome variable?

I try to get the visualization using the following code:

ggplot(varImp(best_model, lambda = 0.008098))

but the plot is blank. There is nothing. What am I doing wrong, here?

Thank you.

Gerry
  • 1
  • 1

0 Answers0