0

First of all, thank you very much for your interest and time. My question (using R): To predict the yvar, I have run a lasso regression which reduced the set of xvariables from 736 to 30.

lasso.mod =glmnet(x,y,alpha=1)
cv.out =cv.glmnet (x,y,alpha=1)
lasso.bestlam =cv.out$lambda.min
tmp_coef = coef(cv.out,s=lasso.bestlam)

varnames = data.frame(name = tmp_coef@Dimnames[[1]][tmp_coef@i])
mylist = list(name = tmp_coef@Dimnames[[1]][tmp_coef@i])

Hence, I have the remaining variable names as a data frame and also as a list. How is it possible to create a new data frame which has these remaining 30 variables and their observations in it? In other words: How can I get a subset of my original data which does not contain 737 variables but only 31?

I think this should be quite easy, however I have been spending more than two hours and it never worked...

Best wishes, Thomas

  • This seems to be a standard column selection problem. Take your old dataframe and select the columns in your list as a vector. E.g. `mtcars[, c("mpg", "cyl")]` will select these two columns from the `mtcars` dataset. – coffeinjunky May 04 '17 at 14:48
  • Searching this site for help with column selection will provide several answers for you. – BLT May 04 '17 at 14:49
  • The problem is that the variables after lasso will maybe change (depending on some other things I will do before running the lasso). Therefore, I do not want to write every time 30 variables by hand. But thanks for your time and consideration. – Thomas_Econ May 04 '17 at 15:40

2 Answers2

0

Cannot test your solution as I do not have the data, but this should do the trick:

varnames <- tmp_coef@Dimnames[[1]][tmp_coef@i]
as.data.frame(cbind(x[, varnames], y))
thothal
  • 16,690
  • 3
  • 36
  • 71
0

Your tmp_coef@Dimnames[[1]][tmp_coef@i] variable contains the names of the remaining variables, but also contains "(Intercept)" as the first item. If you discard it with -1], you can extract the columns:

x <- as.data.frame(x[, tmp_coef@Dimnames[[1]][tmp_coef@i][-1]])

Even simpler, you can use the indices in tmp_coef@i directly:

x <- as.data.frame(x[, tmp_coef@i[-1]])
David Robinson
  • 77,383
  • 16
  • 167
  • 187