3

I want to perform multiple linear regression in python with lasso. I am not sure whether the input observation matrix X can contain categorical variables. I read the instructions from here: lasso in python

But it is simple and not indicate the types allowed for. For example, my code includes:

model = Lasso(fit_intercept=False, alpha=0.01)
model.fit(X, y)

In the code above, X is an observation matrix with size of n-by-p, can one of the p variables be categorical type?

emberbillow
  • 179
  • 3
  • 10
  • No, not just Lasso, entire sklearn doesn't support categorical vars in fit(), predict() functions – Shihab Shahriar Khan Nov 16 '19 at 07:17
  • If you want to use categorical features try encoding them into numerical values using `OneHotEncoder` module of `sklearn`, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html – Parthasarathy Subburaj Nov 16 '19 at 08:26

2 Answers2

1

You need to represent the categorical variables using 1s and 0s. If your categorical variables are binary, meaning each belongs to one of two categories, then you replace all category A and B variables into 0s and 1s, respectively. If some have more than two categories, you will need to use dummy variables.

I usually have my data in a Pandas dataframe, in which case I use houses = pd.get_dummies(houses), which creates the dummy variables.

1

A previous poster has a good answer for this, you need to encode your categorical variables. The standard way is one hot encoding (or dummy encoding), but there are a many methods for doing this.

Here is a good library that has many different ways you can encode your categorical variables. These are also implemented to work with Sci-kit learn.

https://contrib.scikit-learn.org/categorical-encoding/

jawsem
  • 751
  • 5
  • 8