How to do Feature Selection if X is nomical and y is nomical as well?

Question

According to machine learning mastery. We should use sklearn.feature_selection.chi2 or sklearn.feature_selection.mutual_info_classif for X nominal and y nominal.

However, both chi2 and mutual_info_classif do not accept X nominal. It wants a X numeric.

I am confused whether to encoded these X nominal with OrdinalEncoder, LabelEncoder, or OneHotEncoder. My guts tell me to use OneHotEncoder because nominal does not have any relationship between 0, 1, 2.

However, the result is gibberish and I don't understand why 1 column turns into 5 mutual_info value. So, I believe OneHotEncoder is not supposed to be used with mutual_info_classif.

What I've did:

Use OneHotEncoder

mutual_info_classif(OneHotEncoder().fit_transform(X_train['workclass'].values.reshape(-1,1)), y_train)

array([1.57589114e-03, 5.67522742e-04, 6.98987129e-05, 6.94975390e-03, 8.32101543e-03, 2.74923178e-04, 3.14836186e-05, 1.51472199e-04])

Use OrdinalEncoder

mutual_info_classif(MinMaxScaler().fit_transform(OrdinalEncoder().fit_transform(X_train['workclass'].values.reshape(-1,1))), y_train)

array([0.01156094])

Use LabelEncoder

mutual_info_classif(LabelEncoder().fit_transform(X_train['workclass']).reshape(-1,1), y_train)

array([0.0160716])

Use OneHotEncoder again and assume that column_value with high mi, then choose the top column as features.

enc = OneHotEncoder()
mi = mutual_info_classif(enc.fit_transform(X_train[nominal]), y_train)
pd.DataFrame({'mi': mi}, index=enc.get_feature_names_out()) \
  .sort_values(by='mi', ascending=False)

marital_status_ Married-civ-spouse 1.070108e-01 relationship_ Husband 8.200896e-02 marital_status_ Never-married 6.353818e-02 relationship_ Own-child 3.674731e-02 sex_ Male 2.601720e-02 ... ...

How to do Feature Selection if X is nomical and y is nomical as well?

0 Answers0