2

I am working on a regression project in sklearn where I used LASSO regression on a variety of numeric and categorical variables. The categorical variables were transformed using the One-hot-encoder method.

Since the feature matrix was normalized in the beginning, the absolute value of the coefficients in the final LASSO model should be able to represent the relative importance of the model.

However, I cannot figure out the way to compare importance between a numeric variable and a categorical variable. For example (to predict housing price using square footage and household type):

Feature         Coefficient
sqft             114.35
type_house       67.11
type_apartment   -23.97
type_condo       5.14

What should be a reasonable way to compare the importance of sqft and type?

Xiaoyu Lu
  • 3,280
  • 1
  • 22
  • 34

1 Answers1

1

LASSO allows for feature selection but through estimation of model with changing λ (the penalty coefficient). Just plot estimated coefficient on y-axis and λ on x-axis. This will allow you to see how the variable importance changes with increasing regularisation penalty.

Here you will find more detailed description (picture's source). What you can observe is that, wt is one of the most important variables, since even though the penalty (λ) is high > 1, it still has value different for zero.

Variable importance and lambda

An economist
  • 1,301
  • 1
  • 15
  • 35
  • 2
    Sorry but this does not address my question. It is useful to compare importance of different variables, but if some variables are encoded values of the same categorical variable, it does not provide information for the combined importance. – Xiaoyu Lu Oct 12 '18 at 19:02
  • For combined importance the usual approach is F-test. – An economist Oct 13 '18 at 02:22