I have a data base witch has 5000 observations (rows) and 40 variables (columns). The number of categorical variables is 25 and the number of continuous variables is 15. I want to use a regression model to predict a continuous variable using categorical and continuous predictors (variables). Besides I consider to do a kind of feature selection using lasso algorithm (glmnet() from package glmnet) to prevent using all of the variables as predictors and use just some of them which are determined by lasso (important ones).
My question is how lasso can work with categorical variables? I should convert my data frame into matrix because glmnet() get the data in matrix type. When I convert it to matrix all the columns's class change to character. But as you know I need some columns to be categorical and some to be continuous.How should I fix this problem?
In other words, How can I do regression model and lasso as feature selection on the data that has some categorical and some continuous variables to predict a continuous variable?
I create a database as a toy data:
a <- sample(1000:1000000 , 60 , replace = T)
b <- sample(50000:100000000 , 60 , replace = T )
c <- sample(1:90 , 60 , replace = T)
d <- c("accident" , "injury" , "surgical" , "poison")
d <- rep(d , 15 )
e <- paste(letters[1:6] , "#" , sep="")
e <- rep(e, 10)
x <- cbind(a,b,c,d,e)
data.toy <- as.data.frame(x)
head(data.toy)
data.toy$a <- as.numeric(data.toy$a)
data.toy$b <- as.numeric(data.toy$b)
data.toy$c <- as.numeric(data.toy$c)
variables a, b are continuous and d, e are categorical. These four are predictors and the response is column b which is continuous. Use this toy data to help me with my problem.
Any little help would be greatly appreciated.