Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn

Question

I'm working with a really simple dataset. It has some missing values, both in categorical and numeric features. Because of this, I'm trying to use sklearn.preprocessing.KNNImpute to get the most accurate imputation I can. However, when I run the following code:

imputer = KNNImputer(n_neighbors=120)

imputer.fit_transform(x_train)

I get the error: ValueError: could not convert string to float: 'Private'

That makes sense, it obviously can't handle categorical data. But when I try to run OneHotEncoder with:

encoder = OneHotEncoder(drop="first")

encoder.fit_transform(x_train[categorical_features])

It throws the error: ValueError: Input contains NaN

I'd prefer to use KNNImpute even with the categorical data as I feel like I'd be losing some accuracy if I just use a ColumnTransform and impute with numeric and categorical data seperately. Is there any way to get OneHotEncoder to ignore these missing values? If not, is using ColumnTransform or a simpler imputer a better way of tackling this problem?

Thanks in advance

Using `KNNImpute` to fill one-hot-encoded categoricals will generally produce values between 0 and 1 (unless `n_neighbors=1`); is that fine for your use? And if so, how would you want to encode missing values: as a new column, or as all-zeros? — Ben Reiniger, Jul 13 '20 at 15:24
@BenReiniger That should be fine as long as I just round the value between 0 and 1, right? For the second part of your question, I think defaulting to 0s for missing categorical variables would be the best. However, I was hoping there was a way to ignore these NAs in `OneHotEncoder()` so I could KNNImpute those NAs in categorical variables as well. Is there no real way to do that? — DuplicitousManowar, Jul 13 '20 at 16:06

score 3 · Accepted Answer · answered Jul 15 '20 at 21:07

3

There are open issues/PRs to handle missing values on OneHotEncoder, but it's not clear yet what the options would be. In the interim, here's a manual approach.

Fill categorical missings with pandas or SimpleImputer with the string "missing".
Use OneHotEncoder then.
Use the one-hot encoder's get_feature_names to identify the columns corresponding to each original feature, and in particular the "missing" indicator.
For each row and each original categorical feature, when the 1 is in the "missing" column, replace the 0's with np.nan; then delete the missing indicator column.
Now everything should be set up to run KNNImputer.
Finally, if desired, postprocess the imputed categorical-encoding columns. (Simply rounding might get you an all-zeros row for a categorical feature, but I don't think with KNNImputer you could get more than one 1 in a row. You could argmax instead to get back exactly one 1.)

answered Jul 15 '20 at 21:07

Ben Reiniger

10,517
3
16
29

Thanks man, this really helped out a ton appreciate it – DuplicitousManowar Jul 18 '20 at 19:28
Can we use `strategy="most_frequent"` for `SimpleImputer`, instead of replace it with "missing" string? But It will return an array, without a column name which lead to problem for `OneHotEncoder`. – curiouscheese May 25 '22 at 13:13
@xkderhaka That won't do what the Question asks for; you won't be able to distinguish between the missing values and the most frequent values, and the `KNNImputer` won't be able to do anything for the categorical features' missing values. – Ben Reiniger May 25 '22 at 14:19
I mean, what I am understand, if we use `strategy=most_frequent`, it will replace the `NaN` with the most frequent data to the NaN, right? After that, there's no more `NaN` anymore in the dataframe (as it replace by most frequent). If there is no more NaN, we can continue doing OHE, aren't we? – curiouscheese May 26 '22 at 02:40

Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn

1 Answers1

Linked