According to machine learning mastery. We should use sklearn.feature_selection.chi2
or sklearn.feature_selection.mutual_info_classif
for X nominal and y nominal.
However, both chi2
and mutual_info_classif
do not accept X nominal. It wants a X numeric.
I am confused whether to encoded these X nominal with OrdinalEncoder, LabelEncoder, or OneHotEncoder. My guts tell me to use OneHotEncoder because nominal does not have any relationship between 0, 1, 2.
However, the result is gibberish and I don't understand why 1 column turns into 5 mutual_info value. So, I believe OneHotEncoder is not supposed to be used with mutual_info_classif.
What I've did:
- Use OneHotEncoder
mutual_info_classif(OneHotEncoder().fit_transform(X_train['workclass'].values.reshape(-1,1)), y_train)
array([1.57589114e-03, 5.67522742e-04, 6.98987129e-05, 6.94975390e-03, 8.32101543e-03, 2.74923178e-04, 3.14836186e-05, 1.51472199e-04])
- Use OrdinalEncoder
mutual_info_classif(MinMaxScaler().fit_transform(OrdinalEncoder().fit_transform(X_train['workclass'].values.reshape(-1,1))), y_train)
array([0.01156094])
- Use LabelEncoder
mutual_info_classif(LabelEncoder().fit_transform(X_train['workclass']).reshape(-1,1), y_train)
array([0.0160716])
- Use OneHotEncoder again and assume that
column_value
with high mi, then choose the top column as features.
enc = OneHotEncoder()
mi = mutual_info_classif(enc.fit_transform(X_train[nominal]), y_train)
pd.DataFrame({'mi': mi}, index=enc.get_feature_names_out()) \
.sort_values(by='mi', ascending=False)
marital_status_ Married-civ-spouse 1.070108e-01 relationship_ Husband 8.200896e-02 marital_status_ Never-married 6.353818e-02 relationship_ Own-child 3.674731e-02 sex_ Male 2.601720e-02 ... ...