Feature selection and categorical variables

Question

I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:

Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?

Here is my code

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()

score 1 · Accepted Answer · answered Jan 29 '21 at 12:09

It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.

You can use the group lasso implementation in python. Below I use an example dataset:

import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups

import scipy.sparse

data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
                    'cat2':np.random.choice(['D','E','F'],100),
                    'bin1':np.random.choice([0,1],100),
                    'bin2':np.random.choice([0,1],100)})

data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)

Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:

cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']

ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)

Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:

X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']

Likewise, construct the groups:

groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups

Run the group lasso:

grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)

grpLasso.sparsity_mask_
array([ True,  True,  True, False, False, False,  True,  True])

grpLasso.chosen_groups_
{0, 3, 4}

Check out also the help page for using it in a pipeline.

Feature selection and categorical variables

1 Answers1