0

I have a data set in which I want to perform lasso for feature elimination. I am currently following a guide online in R as I am new to R. The data is stored in a dataframe. The target has been removed from the dataframe and is stored in its own dataframe of a single column. This is a regression problem, and and the target is numeric. Here is the code I am trying to run:

library(glmnet)

lasso_model <- cv.glmnet(
                  x = as.matrix(train),
                  y = train_target,
                  alpha = 1)

Here is information about the dataset:

'data.frame':   9798 obs. of  55 variables:
$ acres: num  0.186 2.991 0.144 0.218 0.173 ...
$ above: int  1754 3030 1531 834 1022 1528 768 1184 2026 3176 ...
$ basement: int  0 1811 500 440 0 476 0 0 732 0 ...
$ baths: Factor w/ 7 levels "0","1","2","3",..: 3 4 3 3 2 3 2 2 3 3 ...
$ toilets: Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 2 1 1 2 2    ...
$ fireplaces: Factor w/ 6 levels "0","1","2","3",..: 2 2 2 2 1 1 1 2 2  2 ...
$ beds: Factor w/ 7 levels "1","2","3","4",..: 4 5 2 2 2 3 2 2 3 5 ...
$ rooms: Factor w/ 15 levels "0","1","2","3",..: 5 5 5 4 5 3 3 3 4 6 ...
$ age: int  103 17 13 46 116 12 93 93 42 100 ...
$ yearsfromsale: Factor w/ 3 levels "2","3","4": 2 2 2 1 2 2 3 3 1 1 ...
$ car: Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 3 1 1 4 1 ...
$ city_DES.MOINES: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 2 2 2 2 ...
$ city_JOHNSTON: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
$ city_WEST.DES.MOINES: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_CLIVE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_URBANDALE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_ALTOONA: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ city_BONDURANT: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_CROCKER.TWNSHP: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_GRIMES: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_POLK.CITY: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_PLEASANT.HILL: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ city_WINDSOR.HEIGHTS: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50315: Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
$ zip_50321: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
$ zip_50320: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50312: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
$ zip_50314: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50311: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50309: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50316: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
$ zip_50317: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50313: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ zip_50310: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50322: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50131: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ zip_50111: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ zip_50265: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50266: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50325: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50323: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50009: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ zip_50035: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50023: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50226: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50021: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50327: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ zip_50324: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ walkout_0: Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 2 ...
$ walkout_1: Factor w/ 2 levels "0","1": 1 2 1 1 1 2 1 1 1 1 ...
$ condition_Normal: Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 1 1 1 ...
$ condition_Above.Normal: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 2 1 1 2 ...
$ condition_Below.Normal: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
$ AC_1: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...

When trying to run the lasso_model line, this is the error that I am getting:

Error in cbind2(1, newx) %*% nbeta : 
invalid class 'NA' to  dup_mMatrix_as_dgeMatrix

Essentially, I want to be able to identify which variables to remove. Any help would be great!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
rmahesh
  • 739
  • 2
  • 14
  • 30
  • Do your data contain `NA` values? – liori Sep 09 '18 at 16:12
  • @liori That's the thing I'm confused with, there are no missing values in this dataset. – rmahesh Sep 09 '18 at 16:13
  • Code that comes *after* the line producing the error is obviously irrelevant to the issue, and it is strongly recommended not to include it in the question as it just creates clutter (edited & removed) – desertnaut Sep 09 '18 at 17:38

1 Answers1

1

Ok, it's a strong suspicion.

You have factors in your data frame. as.matrix converts them to strings, not numbers, and glmnet doesn't know what to do with them:

> df <- data.frame(a=as.factor(c('0', '1', '2')), b=as.factor(c('0', '0', '1')))
> df
  a b
1 0 0
2 1 0
3 2 1
> as.matrix(df)
     a   b  
[1,] "0" "0"
[2,] "1" "0"
[3,] "2" "1"

Try converting them explicitly back to numbers (somewhat roundabout way, but should work):

> as.matrix(data.frame(lapply(df, function(x) as.numeric(as.character(x)))))
     a b
[1,] 0 0
[2,] 1 0
[3,] 2 1
liori
  • 40,917
  • 13
  • 78
  • 105
  • I am quite new to the syntax to R so please forgive me. You are recommending that the factors be converted to integers, and then run the lasso? My only suspicion is that I don't want the lasso model to think that (for the ordinal categories) that 2 has precedence over 1, as they are both categories. Is this a problem that I will run into? – rmahesh Sep 09 '18 at 16:37
  • The glmnet package simply does not have any special treatment for factor variables. If you want some special treatment, like one-hot encoding, you have to implement it yourself… or use some wrapper over the `glmnet` package, such as https://cran.r-project.org/web/packages/glmnetUtils/vignettes/intro.html or the `caret` package. – liori Sep 09 '18 at 16:40
  • Would the feature selection be incorrect if I were to not do any special treatment for the factor variables? I've tried implementing glmnet but am quite confused with many of the syntax and ran into some errors I was unable to resolve. This is the closest I was able to get using this package. – rmahesh Sep 09 '18 at 16:42
  • For factors with just two levels, not really, and most of your variables are two-level only. Their one-hot encoding would lead to essentially the same model. The other variables seem to have a natural ordering (e.g. 1 bedroom < 2 bedrooms < 3 bedrooms, etc.), so it wouldn't be very bad either: it probably breaks the assumption of linear relationship between the number of bedrooms and whatever your target variable is, but in some situations it's still "good enough". Though, I believe the `glmnetUtils` package should be simple enough for your purposes. – liori Sep 09 '18 at 16:49