I am working with R
to build a predictive model using Binary Logistic with the Lasso penalty.
Originally that data set consisted of 63147
observations and 22
variables, with 3% of the observations coming from $G_1$ and 97% coming from $G_2$. As can be seen this is very unbalance so I have taken a sample of 30%
coming from $G_1$ and 70%
coming from $G_2$ with a sample size of 5000
.
I have tried fitting 2 models, the classical binary logistic regression (BLR) using the glm
package in R-software
and the binary logistic regression with the Lasso penalty using glmnet
package.
I have scaled my data using the scale
function in R
because the variables were measured using different measurements.
When fitting the BLR an error occurred as can be seen below:
> BLR.Model.SubPop <- train(y~., data = Train.Data.SubPop, method = "glm", family = "binomial")
There were 47 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: algorithm did not converge
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
5: glm.fit: algorithm did not converge
6: glm.fit: fitted probabilities numerically 0 or 1 occurred
7: glm.fit: algorithm did not converge)
From the research I have done this is due to separation in our data.
I had then opted to used BLR with the LASSO where I have used the cv.glmnet()
function to find lamnda.min
and lambda.1se
Below are the coefficients for the above mentioned values of lambda
> cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class")
> plot(cv.lasso)
> cv.lasso$lambda.min
[1] 5.575006e-05
> cv.lasso$lambda.1se
[1] 0.0001173485
> coef(cv.lasso,cv.lasso$lambda.min)[,1]
(Intercept) X1 X2 X3
-94.7714913 0 0.17288133 -0.28371818 0.03050103
X4 X5 X6 X7 X8 X9
0.02482283 0 0 -0.26218308 0
X10 X11 X12 X13 X14
0 -2.00853016 0 0 0
X15 X16 X17 X18 X19
-0.01768456 0.56538543 0 0 0.54166489
X20 X21
30.10519005 0.18198277
(Intercept) X1 X2 X3 X4
-71.93503132 0 0.0644656 -0.287559336 0.001068958
X5 X6 X7 X8 X9
0.017905135 0 0 0 0
X10 X11 X12 X13 X14
0 -1.239442745 0 0 0
X15 X16 X17 X18 X19
-0.083885831 0 0 0 0.19158206
X20 X21
22.99517148 0.052543166
When I tried to fit the Lasso using lambda.min
and lambda.1se
I encountered the below warning.
> lasso.model.1se <- glmnet(x, y, alpha = 1, family = "binomial", lambda = cv.lasso$lambda.1se)
Warning messages:
1: from glmnet Fortran code (error code -1); Convergence for 1th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
2: In getcoef(fit, nvars, nx, vnames) :
an empty model has been returned; probably a convergence issue
However, when I run the below code it run
LASSO.prob <- cv.lasso %>% predict(newx=x.test,type = "response")
I am not sure which lambda
cv.lasso is working with, is it lambda.min
or lamnda.1se
? Why am I getting the errors?
I used the below link as a reference to build my code: