0

According to machine learning mastery. We should use sklearn.feature_selection.chi2 or sklearn.feature_selection.mutual_info_classif for X nominal and y nominal.

However, both chi2 and mutual_info_classif do not accept X nominal. It wants a X numeric.

I am confused whether to encoded these X nominal with OrdinalEncoder, LabelEncoder, or OneHotEncoder. My guts tell me to use OneHotEncoder because nominal does not have any relationship between 0, 1, 2.

However, the result is gibberish and I don't understand why 1 column turns into 5 mutual_info value. So, I believe OneHotEncoder is not supposed to be used with mutual_info_classif.

What I've did:

  1. Use OneHotEncoder
mutual_info_classif(OneHotEncoder().fit_transform(X_train['workclass'].values.reshape(-1,1)), y_train)

array([1.57589114e-03, 5.67522742e-04, 6.98987129e-05, 6.94975390e-03, 8.32101543e-03, 2.74923178e-04, 3.14836186e-05, 1.51472199e-04])

  1. Use OrdinalEncoder
mutual_info_classif(MinMaxScaler().fit_transform(OrdinalEncoder().fit_transform(X_train['workclass'].values.reshape(-1,1))), y_train)

array([0.01156094])

  1. Use LabelEncoder
mutual_info_classif(LabelEncoder().fit_transform(X_train['workclass']).reshape(-1,1), y_train)

array([0.0160716])

  1. Use OneHotEncoder again and assume that column_value with high mi, then choose the top column as features.
enc = OneHotEncoder()
mi = mutual_info_classif(enc.fit_transform(X_train[nominal]), y_train)
pd.DataFrame({'mi': mi}, index=enc.get_feature_names_out()) \
  .sort_values(by='mi', ascending=False)

marital_status_ Married-civ-spouse 1.070108e-01 relationship_ Husband 8.200896e-02 marital_status_ Never-married 6.353818e-02 relationship_ Own-child 3.674731e-02 sex_ Male 2.601720e-02 ... ...

Jason Rich Darmawan
  • 1,607
  • 3
  • 14
  • 31

0 Answers0