4

I have many continuous independent variables and a dependent dummy variable in my data set about individuals in given years. I want to perform feature selection using Logistic Random Effects Lasso/Logistic Fixed Effects Lasso. However, the default settings of glmnet for my estimation procedure is that I am using cross-sectional data while I want R to see my data as panel data, and it thus models a Logistic Lasso while I want a Logistic Random Effects Lasso/Logistic Fixed Effects Lasso model.

Therefore, in the example code below, I want to let R know that I am using a panel data set and that ID are my individuals/cross-sectional units and year are the years I have observations for each ID. In the code below, all individuals are pooled and I even get coefficients for ID (and year) in this Logistic Lasso estimation. How can I estimate a Logistic Random Effects Lasso/Logistic Fixed Effects Lasso model in R?

df=cbind(c(1,546,2,56,6,73,4234,436,647,567,87,2,5,76,5,456,6756,6,132,78,32),c(2,3546,26,568,76,873,234,36,67,57,887,29,50,736,51,56,676,62,32,782,322),10:30)
year=rep(1:3, times=7)
ID=rep(1:7, each=3)
x=as.matrix(cbind(ID,year,df))
y1=as.data.frame(rep(c(0,1), each = 18))[1:21,]
y=as.matrix(y1)

fit=glmnet(x,y,alpha=1,family="binomial")
lambdamin=min(fit$lambda)
predict.glmnet(fit,s=lambdamin,newx=x,type="coefficients")
                        1
(Intercept) -8.309211e+01
ID           1.281220e+01
year         .           
            -2.339904e-04
             .           
             .           

1 Answers1

2

For lasso+FE, you can first demean both sides of your regression by following the logic given e.g. here, and then run lasso via glmnet.

Lasso+random effects is a bit more complicated beast mathematically and it is not supported out of the box with glmnet. There exists a package for doing a mixed-model lasso here, but I haven't tried it.

Otto Kässi
  • 2,943
  • 1
  • 10
  • 27
  • That seems reasonable. For FE I need to demean wrt each individual. But do I need to standardize (center and divide by standard deviation) the resulting independent variables as a whole after demeaning wrt each individual as typically is done before Lasso? – Luuk van Gasteren May 01 '19 at 15:00
  • I would first demean and then standardise because if you first standardise and demean thereafter, the stuff going into lasso will no longer be mean-zero and 1-std. This might be a matter of taste though. – Otto Kässi May 01 '19 at 15:44
  • Yes, that is also what I was thinking. – Luuk van Gasteren May 01 '19 at 15:48
  • Another question: when you within transform the y-variable (binary) then the transformed variable (y_it-y_mean_i) can have five values: -0.67, -0.33, 0, 0.33 or 0.67. Hence, then the binomial logistic model must be a multinomial model. Is this true? Also, then the interpretation of the model's coefficients becomes quite strange as the y-values now are deviations from the individual average. – Luuk van Gasteren May 02 '19 at 10:18
  • I had missed that your outcome variable is binary, apologies! You could try and estimate a linear probability model (https://en.wikipedia.org/wiki/Linear_probability_model), which most likely will approximate logistic regression very well – Otto Kässi May 02 '19 at 14:27
  • 1
    The answer fails to mention that you will need to adjust the standard errors if using the demeaning approach. – hrrrrrr5602 Dec 05 '20 at 17:53